Chai: Collaborative heterogeneous applications for integrated-architectures

J Gómez-Luna, I El Hajj, LW Chang… - … Analysis of Systems …, 2017 - ieeexplore.ieee.org
Heterogeneous system architectures are evolving towards tighter integration among
devices, with emerging features such as shared virtual memory, memory coherence, and …

Fast segmented sort on gpus

K Hou, W Liu, H Wang, W Feng - Proceedings of the International …, 2017 - dl.acm.org
Segmented sort, as a generalization of classical sort, orders a batch of independent
segments in a whole array. Along with the wider adoption of manycore processors for HPC …

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

IRIS: A performance-portable framework for cross-platform heterogeneous computing

J Kim, S Lee, B Johnston… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
From edge to exascale, computer architectures are becoming more heterogeneous and
complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware …

Wireframe: Supporting data-dependent parallelism through dependency graph execution in gpus

AA Abdolrashidi, D Tripathy, ME Belviranli… - Proceedings of the 50th …, 2017 - dl.acm.org
GPUs lack fundamental support for data-dependent parallelism and synchronization. While
CUDA Dynamic Parallelism signals progress in this direction, many limitations and …

Computation vs. communication scaling for future transformers on future hardware

S Pati, S Aga, M Islam, N Jayasena… - arxiv preprint arxiv …, 2023 - arxiv.org
Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling has increased the reliance on efficient distributed training techniques …

Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems

AA Abdolrashidi, HA Esfeden… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …

Versapipe: a versatile programming framework for pipelined computing on GPU

Z Zheng, C Oh, J Zhai, X Shen, Y Yi… - Proceedings of the 50th …, 2017 - dl.acm.org
Pipeline is an important programming pattern, while GPU, designed mostly for data-level
parallel executions, lacks an efficient mechanism to support pipeline programming and …

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

M Khairy, AG Wassal, M Zahran - Journal of Parallel and Distributed …, 2019 - Elsevier
With the skyrocketing advances of process technology, the increased need to process huge
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …

Oversubscribed command queues in GPUs

S Puthoor, X Tang, J Gross, BM Beckmann - Proceedings of the 11th …, 2018 - dl.acm.org
As GPUs become larger and provide an increasing number of parallel execution units, a
single kernel is no longer sufficient to utilize all available resources. As a result, GPU …