A survey of recent prefetching techniques for processor caches

S Mittal - ACM Computing Surveys (CSUR), 2016 - dl.acm.org
As the trends of process scaling make memory systems an even more crucial bottleneck, the
importance of latency hiding techniques such as prefetching grows further. However, naively …

Pythia: A customizable hardware prefetching framework using online reinforcement learning

R Bera, K Kanellopoulos, A Nori, T Shahroodi… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Past research has proposed numerous hardware prefetching techniques, most of which rely
on exploiting one specific type of program context information (eg, program counter …

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation

K Hsieh, S Khan, N Vijaykumar… - 2016 IEEE 34th …, 2016 - ieeexplore.ieee.org
Pointer chasing is a fundamental operation, used by many important data-intensive
applications (eg, databases, key-value stores, graph processing workloads) to traverse …

High performance RDMA-based MPI implementation over InfiniBand

J Liu, J Wu, SP Kini, P Wyckoff, DK Panda - Proceedings of the 17th …, 2003 - dl.acm.org
Although InfiniBand Architecture is relatively new in the high performance computing area, it
offers many features which help us to improve the performance of communication …

Runahead execution: An alternative to very large instruction windows for out-of-order processors

O Mutlu, J Stark, C Wilkerson… - The Ninth International …, 2003 - ieeexplore.ieee.org
Today's high performance processors tolerate long latency operations by means of out-of-
order execution. However, as latencies increase, the size of the instruction window must …

IMP: Indirect memory prefetcher

X Yu, CJ Hughes, N Satish, S Devadas - Proceedings of the 48th …, 2015 - dl.acm.org
Machine learning, graph analytics and sparse linear algebra-based applications are
dominated by irregular memory accesses resulting from following edges in a graph or non …

Multithreaded processors

T Ungerer, B Robič, J Šilc - The Computer Journal, 2002 - academic.oup.com
The instruction-level parallelism found in a conventional instruction stream is limited. Studies
have shown the limits of processor utilization even for today's superscalar microprocessors …

Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design

N Talati, K May, A Behroozi, Y Yang… - … Symposium on High …, 2021 - ieeexplore.ieee.org
Irregular workloads are typically bottlenecked by the memory system. These workloads often
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …

When prefetching works, when it doesn't, and why

J Lee, H Kim, R Vuduc - ACM Transactions on Architecture and Code …, 2012 - dl.acm.org
In emerging and future high-end processor systems, tolerating increasing cache miss
latency and properly managing memory bandwidth will be critical to achieving high …

Microarchitecture optimizations for exploiting memory-level parallelism

Y Chou, B Fahs, S Abraham - ACM SIGARCH Computer Architecture …, 2004 - dl.acm.org
The performance of memory-bound commercial applicationssuch as databases is limited by
increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism …