A modern primer on processing in memory
Modern computing systems are overwhelmingly designed to move data to computation. This
design choice goes directly against at least three key trends in computing that cause …
design choice goes directly against at least three key trends in computing that cause …
Processing-in-memory: A workload-driven perspective
Many modern and emerging applications must process increasingly large volumes of data.
Unfortunately, prevalent computing paradigms are not designed to efficiently handle such …
Unfortunately, prevalent computing paradigms are not designed to efficiently handle such …
DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
Figaro: Improving system performance via fine-grained in-dram data relocation and caching
Main memory, composed of DRAM, is a performance bottleneck for many applications, due
to the high DRAM access latency. In-DRAM caches work to mitigate this latency by …
to the high DRAM access latency. In-DRAM caches work to mitigate this latency by …
Smash: Co-designing software compression and hardware-accelerated indexing for efficient sparse matrix operations
Important workloads, such as machine learning and graph analytics applications, heavily
involve sparse linear algebra operations. These operations use sparse matrix compression …
involve sparse linear algebra operations. These operations use sparse matrix compression …
MGPUSim: Enabling multi-GPU performance modeling and optimization
The rapidly growing popularity and scale of data-parallel workloads demand a
corresponding increase in raw computational power of Graphics Processing Units (GPUs) …
corresponding increase in raw computational power of Graphics Processing Units (GPUs) …
Paver: Locality graph-based thread block scheduling for gpus
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …
Stream-based memory access specialization for general purpose processors
Because of severe limitations in technology scaling, architects have innovated in
specializing general purpose processors for computation primitives (eg vector instructions …
specializing general purpose processors for computation primitives (eg vector instructions …
Architecting waferscale processors-a GPU case study
Increasing communication overheads are already threatening computer system scaling. One
approach to dramatically reduce communication overheads is waferscale processing …
approach to dramatically reduce communication overheads is waferscale processing …
Understanding the future of energy efficiency in multi-module gpus
As Moore's law slows down, GPUs must pivot towards multi-module designs to continue
scaling performance at historical rates. Prior work on multi-module GPUs has focused on …
scaling performance at historical rates. Prior work on multi-module GPUs has focused on …