Accel-Sim: An extensible simulation framework for validated GPU modeling

M Khairy, Z Shen, TM Aamodt… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org
In computer architecture, significant innovation frequently comes from industry. However, the
simulation tools used by industry are often not released for open use, and even when they …

PHI: Architectural support for synchronization-and bandwidth-efficient commutative scatter updates

A Mukkara, N Beckmann, D Sanchez - … of the 52nd Annual IEEE/ACM …, 2019 - dl.acm.org
Many applications perform frequent scatter update operations to large data structures. For
example, in push-style graph algorithms, processing each vertex requires updating the data …

[BOK][B] Shared-memory synchronization

ML Scott, T Brown - 2013 - Springer
This monograph grows out of nearly 40 years of experience in synchronization and
concurrent data structures. Though written primarily from the perspective of systems …

A formal analysis of the NVIDIA PTX memory consistency model

D Lustig, S Sahasrabuddhe, O Giroux - Proceedings of the Twenty …, 2019 - dl.acm.org
This paper presents the first formal analysis of the official memory consistency model for the
NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly …

SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU

H Wang, L Geng, R Lee, K Hou, Y Zhang… - Proceedings of the 24th …, 2019 - dl.acm.org
In general, the performance of parallel graph processing is determined by three pairs of
critical parameters, namely synchronous or asynchronous execution mode (Sync or Async) …

Täkō: A polymorphic cache hierarchy for general-purpose optimization of data movement

BC Schwedock, P Yoovidhya, J Seibert… - Proceedings of the 49th …, 2022 - dl.acm.org
Current systems hide data movement from software behind the load-store interface.
Software's inability to observe and respond to data movement is the root cause of many …

Spandex: A flexible interface for efficient heterogeneous coherence

J Alsop, M Sinclair, S Adve - 2018 ACM/IEEE 45th Annual …, 2018 - ieeexplore.ieee.org
Recent heterogeneous architectures have trended toward tighter integration and shared
memory largely due to the efficient communication and programmability enabled by this …

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

Inter-kernel reuse-aware thread block scheduling

M Huzaifa, J Alsop, A Mahmoud, G Salvador… - ACM Transactions on …, 2020 - dl.acm.org
As GPUs have become more programmable, their performance and energy benefits have
made them increasingly popular. However, while GPU compute units continue to improve in …

Photon: A fine-grained sampled simulation methodology for GPU workloads

C Liu, Y Sun, TE Carlson - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
GPUs, due to their massively-parallel computing architectures, provide high performance for
data-parallel applications. However, existing GPU simulators are too slow to enable …