An overview of cache optimization techniques and cache-aware numerical algorithms

M Kowarschik, C Weiß - Algorithms for memory hierarchies: advanced …, 2003 - Springer
In order to mitigate the impact of the growing gap between CPU speed and main memory
performance, today's computer architectures implement hierarchical memory structures. The …

[書籍][B] Parallel computer architecture: a hardware/software approach

D Culler, JP Singh, A Gupta - 1999 - books.google.com
The most exciting development in parallel computer architecture is the convergence of
traditionally disparate approaches on a common machine structure. This book explains the …

[書籍][B] Memory systems: cache, DRAM, disk

B Jacob, D Wang, S Ng - 2010 - books.google.com
Is your memory hierarchy stop** your microprocessor from performing at the high level it
should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem …

[PDF][PDF] Database architecture optimized for the new bottleneck: Memory access

PA Boncz, S Manegold, ML Kersten - VLDB, 1999 - cs.cmu.edu
In the past decade, advances in speed of commodity CPUs have far out-paced advances in
memory latency. Main-memory access is therefore increasingly a performance bottleneck for …

Compiler-based prefetching for recursive data structures

CK Luk, TC Mowry - Proceedings of the seventh international conference …, 1996 - dl.acm.org
Software-controlled data prefetching offers the potential for bridging the ever-increasing
speed gap between the memory subsystem and today's high-performance processors …

Meta optimization: Improving compiler heuristics with machine learning

M Stephenson, S Amarasinghe, M Martin… - ACM sigplan …, 2003 - dl.acm.org
Compiler writers have crafted many heuristics over the years to approximately solve NP-
hard problems efficiently. Finding a heuristic that performs well on a broad range of …

Improving hash join performance through prefetching

S Chen, A Ailamaki, PB Gibbons… - ACM Transactions on …, 2007 - dl.acm.org
Hash join algorithms suffer from extensive CPU cache stalls. This article shows that the
standard hash join algorithm for disk-oriented databases (ie GRACE) spends over 80% of its …

Optimizing main-memory join on modern hardware

S Manegold, P Boncz, M Kersten - IEEE transactions on …, 2002 - ieeexplore.ieee.org
In the past decade, the exponential growth in commodity CPU's speed has far outpaced
advances in memory latency. A second trend is that CPU performance advances are not …

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

CK Luk - Proceedings of the 28th annual international …, 2001 - dl.acm.org
Hardly predictable data addresses in many irregular applications have rendered prefetching
ineffective. In many cases, the only accurate way to predict these addresses is to directly …

Improving the memory-system performance of sparse-matrix vector multiplication

S Toledo - IBM Journal of research and development, 1997 - ieeexplore.ieee.org
Sparse-matrix vector multiplication is an important kernel that often runs inefficiently on
superscalar RISC processors. This paper describes techniques that increase instruction …