Symphony: Orchestrating sparse and dense tensors with hierarchical heterogeneous processing

M Pellauer, J Clemons, V Balaji, N Crago… - ACM Transactions on …, 2023 - dl.acm.org
Sparse tensor algorithms are becoming widespread, particularly in the domains of deep
learning, graph and data analytics, and scientific computing. Current high-performance …

Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles

R Andri, B Bussolino, A Cipolletta… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Most of today's computer vision pipelines are built around deep neural networks, where
convolution operations require most of the generally high compute effort. The Winograd …

Principal kernel analysis: A tractable methodology to simulate scaled GPU workloads

C Avalos Baddouh, M Khairy, RN Green… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-
level simulation is orders of magnitude slower than native silicon, the only solution is to …

Navisim: A highly accurate GPU simulator for AMD RDNA GPUs

Y Bao, Y Sun, Z Feric, MT Shen, M Weston… - Proceedings of the …, 2022 - dl.acm.org
As GPUs continue to grow in popularity for accelerating demanding applications, such as
high-performance computing and machine learning, GPU architects need to deliver more …

Gps: A global publish-subscribe model for multi-gpu memory management

H Muthukrishnan, D Lustig, D Nellans… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Suboptimal management of memory and bandwidth is one of the primary causes of low
performance on systems comprising multiple GPUs. Existing memory management solutions …

Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads

Y Li, Y Sun, A Jog - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
Today, DNNs' high computational complexity and sub-optimal device utilization present a
major roadblock to democratizing DNNs. To reduce the execution time and improve device …

REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems

G Ko, J Lee, H Kal, H Lee, WW Ro - Journal of Systems Architecture, 2025 - Elsevier
With the increasing demands of modern workloads, multi-GPU systems have emerged as a
scalable solution, extending performance beyond the capabilities of single GPUs. However …

Photon: A fine-grained sampled simulation methodology for GPU workloads

C Liu, Y Sun, TE Carlson - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
GPUs, due to their massively-parallel computing architectures, provide high performance for
data-parallel applications. However, existing GPU simulators are too slow to enable …

Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems

H Muthukrishnan, D Lustig, O Villa… - … Symposium on High …, 2023 - ieeexplore.ieee.org
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to
communicate among devices in multi-GPU systems is a promising path to achieve strong …

A Survey on Heterogeneous CPU–GPU Architectures and Simulators

M Alaei, F Yazdanpanah - Concurrency and Computation …, 2025 - Wiley Online Library
Heterogeneous architectures are vastly used in various high performance computing
systems from IoT‐based embedded architectures to edge and cloud systems. Although …