Stash: Have your scratchpad and cache it too
Heterogeneous systems employ specialization for energy efficiency. Since data movement
is expected to be a dominant consumer of energy, these systems employ specialized …
is expected to be a dominant consumer of energy, these systems employ specialized …
Efficient GPU synchronization without scopes: Saying no to complex consistency models
As GPUs have become increasingly general purpose, applications with more general
sharing patterns and fine-grained synchronization have started to emerge. Unfortunately …
sharing patterns and fine-grained synchronization have started to emerge. Unfortunately …
Spandex: A flexible interface for efficient heterogeneous coherence
Recent heterogeneous architectures have trended toward tighter integration and shared
memory largely due to the efficient communication and programmability enabled by this …
memory largely due to the efficient communication and programmability enabled by this …
Selective GPU caches to eliminate CPU-GPU HW cache coherence
Cache coherence is ubiquitous in shared memory multiprocessors because it provides a
simple, high performance memory abstraction to programmers. Recent work suggests …
simple, high performance memory abstraction to programmers. Recent work suggests …
Chasing away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems
An unambiguous and easy-to-understand memory consistency model is crucial for ensuring
correct synchronization and guiding future design of heterogeneous systems. In a widely …
correct synchronization and guiding future design of heterogeneous systems. In a widely …
Lazy release consistency for GPUs
The heterogeneous-race-free (HRF) memory model has been embraced by the
Heterogeneous System Architecture (HSA) Foundation and OpenCL TM because it clearly …
Heterogeneous System Architecture (HSA) Foundation and OpenCL TM because it clearly …
Coherence domain restriction on large scale systems
Designing massive scale cache coherence systems has been an elusive goal. Whether it be
on large-scale GPUs, future thousand-core chips, or across million-core warehouse scale …
on large-scale GPUs, future thousand-core chips, or across million-core warehouse scale …
Mozart: Taming taxes and composing accelerators with shared-memory
Resource-constrained system-on-chips (SoCs) are increasingly heterogeneous with
specialized accelerators for various tasks. Acceleration taxes due to control and data …
specialized accelerators for various tasks. Acceleration taxes due to control and data …
Racer: TSO consistency via race detection
Several recent efforts aim to simplify coherence and its associate costs (eg, directory size,
complexity) in multicores. The bulk of these efforts rely on program data-race-free (DRF) …
complexity) in multicores. The bulk of these efforts rely on program data-race-free (DRF) …
Callback: Efficient synchronization without invalidation with a directory just for spin-waiting
Cache coherence protocols based on self-invalidation allow a simpler design compared to
traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and …
traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and …