Symphony: Orchestrating sparse and dense tensors with hierarchical heterogeneous processing
Sparse tensor algorithms are becoming widespread, particularly in the domains of deep
learning, graph and data analytics, and scientific computing. Current high-performance …
learning, graph and data analytics, and scientific computing. Current high-performance …
Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles
Most of today's computer vision pipelines are built around deep neural networks, where
convolution operations require most of the generally high compute effort. The Winograd …
convolution operations require most of the generally high compute effort. The Winograd …
Principal kernel analysis: A tractable methodology to simulate scaled GPU workloads
C Avalos Baddouh, M Khairy, RN Green… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-
level simulation is orders of magnitude slower than native silicon, the only solution is to …
level simulation is orders of magnitude slower than native silicon, the only solution is to …
Navisim: A highly accurate GPU simulator for AMD RDNA GPUs
As GPUs continue to grow in popularity for accelerating demanding applications, such as
high-performance computing and machine learning, GPU architects need to deliver more …
high-performance computing and machine learning, GPU architects need to deliver more …
Gps: A global publish-subscribe model for multi-gpu memory management
Suboptimal management of memory and bandwidth is one of the primary causes of low
performance on systems comprising multiple GPUs. Existing memory management solutions …
performance on systems comprising multiple GPUs. Existing memory management solutions …
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads
Today, DNNs' high computational complexity and sub-optimal device utilization present a
major roadblock to democratizing DNNs. To reduce the execution time and improve device …
major roadblock to democratizing DNNs. To reduce the execution time and improve device …
REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
With the increasing demands of modern workloads, multi-GPU systems have emerged as a
scalable solution, extending performance beyond the capabilities of single GPUs. However …
scalable solution, extending performance beyond the capabilities of single GPUs. However …
Photon: A fine-grained sampled simulation methodology for GPU workloads
GPUs, due to their massively-parallel computing architectures, provide high performance for
data-parallel applications. However, existing GPU simulators are too slow to enable …
data-parallel applications. However, existing GPU simulators are too slow to enable …
Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to
communicate among devices in multi-GPU systems is a promising path to achieve strong …
communicate among devices in multi-GPU systems is a promising path to achieve strong …
A Survey on Heterogeneous CPU–GPU Architectures and Simulators
M Alaei, F Yazdanpanah - Concurrency and Computation …, 2025 - Wiley Online Library
Heterogeneous architectures are vastly used in various high performance computing
systems from IoT‐based embedded architectures to edge and cloud systems. Although …
systems from IoT‐based embedded architectures to edge and cloud systems. Although …