CoNDA: Efficient cache coherence support for near-data accelerators

A Boroumand, S Ghose, M Patel, H Hassan… - Proceedings of the 46th …, 2019 - dl.acm.org
Specialized on-chip accelerators are widely used to improve the energy efficiency of
computing systems. Recent advances in memory technology have enabled near-data …

Beyond the socket: NUMA-aware GPUs

U Milic, O Villa, E Bolotin, A Arunkumar… - Proceedings of the 50th …, 2017 - dl.acm.org
GPUs achieve high throughput and power efficiency by employing many small single
instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance …

Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems

V Young, A Jaleel, E Bolotin, E Ebrahimi… - 2018 51st Annual …, 2018 - ieeexplore.ieee.org
Historically, improvement in GPU performance has been tightly coupled with transistor
scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau …

A formal analysis of the NVIDIA PTX memory consistency model

D Lustig, S Sahasrabuddhe, O Giroux - Proceedings of the Twenty …, 2019 - dl.acm.org
This paper presents the first formal analysis of the official memory consistency model for the
NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly …

Chronos: Efficient speculative parallelism for accelerators

M Abeydeera, D Sanchez - Proceedings of the Twenty-Fifth International …, 2020 - dl.acm.org
We present Chronos, a framework to build accelerators for applications with speculative
parallelism. These applications consist of atomic tasks, sometimes with order constraints …

Demystifying bert: System design implications

S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …

Spandex: A flexible interface for efficient heterogeneous coherence

J Alsop, M Sinclair, S Adve - 2018 ACM/IEEE 45th Annual …, 2018 - ieeexplore.ieee.org
Recent heterogeneous architectures have trended toward tighter integration and shared
memory largely due to the efficient communication and programmability enabled by this …

Selective GPU caches to eliminate CPU-GPU HW cache coherence

N Agarwal, D Nellans, E Ebrahimi… - … Symposium on High …, 2016 - ieeexplore.ieee.org
Cache coherence is ubiquitous in shared memory multiprocessors because it provides a
simple, high performance memory abstraction to programmers. Recent work suggests …

Hmg: Extending cache coherence protocols across modern hierarchical multi-gpu systems

X Ren, D Lustig, E Bolotin, A Jaleel… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Prior work on GPU cache coherence has shown that simple hardware-or software-based
protocols can be more than sufficient. However, in recent years, features such as multi-chip …

Chasing away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems

MD Sinclair, J Alsop, SV Adve - Proceedings of the 44th Annual …, 2017 - dl.acm.org
An unambiguous and easy-to-understand memory consistency model is crucial for ensuring
correct synchronization and guiding future design of heterogeneous systems. In a widely …