A modern primer on processing in memory

O Mutlu, S Ghose, J Gómez-Luna… - … computing: from devices …, 2022‏ - Springer
Modern computing systems are overwhelmingly designed to move data to computation. This
design choice goes directly against at least three key trends in computing that cause …

Processing-in-memory: A workload-driven perspective

S Ghose, A Boroumand, JS Kim… - IBM Journal of …, 2019‏ - ieeexplore.ieee.org
Many modern and emerging applications must process increasingly large volumes of data.
Unfortunately, prevalent computing paradigms are not designed to efficiently handle such …

DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021‏ - ieeexplore.ieee.org
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

Smash: Co-designing software compression and hardware-accelerated indexing for efficient sparse matrix operations

K Kanellopoulos, N Vijaykumar, C Giannoula… - Proceedings of the …, 2019‏ - dl.acm.org
Important workloads, such as machine learning and graph analytics applications, heavily
involve sparse linear algebra operations. These operations use sparse matrix compression …

Mgpusim: Enabling multi-gpu performance modeling and optimization

Y Sun, T Baruah, SA Mojumder, S Dong… - Proceedings of the 46th …, 2019‏ - dl.acm.org
The rapidly growing popularity and scale of data-parallel workloads demand a
corresponding increase in raw computational power of Graphics Processing Units (GPUs) …

FIGARO: Improving system performance via fine-grained in-DRAM data relocation and caching

Y Wang, L Orosa, X Peng, Y Guo… - 2020 53rd Annual …, 2020‏ - ieeexplore.ieee.org
Main memory, composed of DRAM, is a performance bottleneck for many applications, due
to the high DRAM access latency. In-DRAM caches work to mitigate this latency by …

Architecting waferscale processors-a gpu case study

S Pal, D Petrisko, M Tomei, P Gupta… - … Symposium on High …, 2019‏ - ieeexplore.ieee.org
Increasing communication overheads are already threatening computer system scaling. One
approach to dramatically reduce communication overheads is waferscale processing …

Stream-based memory access specialization for general purpose processors

Z Wang, T Nowatzki - Proceedings of the 46th International Symposium …, 2019‏ - dl.acm.org
Because of severe limitations in technology scaling, architects have innovated in
specializing general purpose processors for computation primitives (eg vector instructions …

Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021‏ - dl.acm.org
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

Common counters: Compressed encryption counters for secure GPU memory

S Na, S Lee, Y Kim, J Park, J Huh - 2021 IEEE International …, 2021‏ - ieeexplore.ieee.org
Hardware-based trusted execution has opened a promising new opportunity for enabling
secure cloud computing. Nevertheless, the current trusted execution environments are …