The performance of parallel programs has suffered from memory access latencies induced by cache misses. In this paper, to investigate the causes of these cache misses, data parallel applications were exec ted on shared memory multiprocessors. The experiment showed that cache conflict misses occupied most of the cache misses. This was due to the cross interference among the grains composed of the part of data arrays. To address this problem, a tailored grain size was devised from the underlying cache architecture. Besides the interference among grains, cache performance was sensitive to the way data were constructed. To make data structure for exhibiting good cache behavior, a stride merging-arrays method was presented. This method entailed the reduction of cache conflict misses and reduced the useless prefetches in cache lines with multiple words. Simulation results show that these techniques may enhance the performance of parallel applications due to the improved cache performance.
|