CoNDA: Efficient cache coherence support for near-data accelerators
Specialized on-chip accelerators are widely used to improve the energy efficiency of
computing systems. Recent advances in memory technology have enabled near-data …
computing systems. Recent advances in memory technology have enabled near-data …
Beyond the socket: NUMA-aware GPUs
GPUs achieve high throughput and power efficiency by employing many small single
instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance …
instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance …
Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems
Historically, improvement in GPU performance has been tightly coupled with transistor
scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau …
scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau …
A formal analysis of the NVIDIA PTX memory consistency model
D Lustig, S Sahasrabuddhe, O Giroux - Proceedings of the Twenty …, 2019 - dl.acm.org
This paper presents the first formal analysis of the official memory consistency model for the
NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly …
NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly …
Chronos: Efficient speculative parallelism for accelerators
We present Chronos, a framework to build accelerators for applications with speculative
parallelism. These applications consist of atomic tasks, sometimes with order constraints …
parallelism. These applications consist of atomic tasks, sometimes with order constraints …
Demystifying bert: System design implications
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …
tackle challenging problems. Consequently, these applications are driving the requirements …
Spandex: A flexible interface for efficient heterogeneous coherence
Recent heterogeneous architectures have trended toward tighter integration and shared
memory largely due to the efficient communication and programmability enabled by this …
memory largely due to the efficient communication and programmability enabled by this …
Selective GPU caches to eliminate CPU-GPU HW cache coherence
Cache coherence is ubiquitous in shared memory multiprocessors because it provides a
simple, high performance memory abstraction to programmers. Recent work suggests …
simple, high performance memory abstraction to programmers. Recent work suggests …
Hmg: Extending cache coherence protocols across modern hierarchical multi-gpu systems
Prior work on GPU cache coherence has shown that simple hardware-or software-based
protocols can be more than sufficient. However, in recent years, features such as multi-chip …
protocols can be more than sufficient. However, in recent years, features such as multi-chip …
Chasing away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems
An unambiguous and easy-to-understand memory consistency model is crucial for ensuring
correct synchronization and guiding future design of heterogeneous systems. In a widely …
correct synchronization and guiding future design of heterogeneous systems. In a widely …