Chai: Collaborative heterogeneous applications for integrated-architectures
Heterogeneous system architectures are evolving towards tighter integration among
devices, with emerging features such as shared virtual memory, memory coherence, and …
devices, with emerging features such as shared virtual memory, memory coherence, and …
Fast segmented sort on gpus
Segmented sort, as a generalization of classical sort, orders a batch of independent
segments in a whole array. Along with the wider adoption of manycore processors for HPC …
segments in a whole array. Along with the wider adoption of manycore processors for HPC …
CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
IRIS: A performance-portable framework for cross-platform heterogeneous computing
From edge to exascale, computer architectures are becoming more heterogeneous and
complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware …
complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware …
Wireframe: Supporting data-dependent parallelism through dependency graph execution in gpus
GPUs lack fundamental support for data-dependent parallelism and synchronization. While
CUDA Dynamic Parallelism signals progress in this direction, many limitations and …
CUDA Dynamic Parallelism signals progress in this direction, many limitations and …
Computation vs. communication scaling for future transformers on future hardware
Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling has increased the reliance on efficient distributed training techniques …
However, this scaling has increased the reliance on efficient distributed training techniques …
Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …
Versapipe: a versatile programming framework for pipelined computing on GPU
Pipeline is an important programming pattern, while GPU, designed mostly for data-level
parallel executions, lacks an efficient mechanism to support pipeline programming and …
parallel executions, lacks an efficient mechanism to support pipeline programming and …
A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
With the skyrocketing advances of process technology, the increased need to process huge
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …
Oversubscribed command queues in GPUs
As GPUs become larger and provide an increasing number of parallel execution units, a
single kernel is no longer sufficient to utilize all available resources. As a result, GPU …
single kernel is no longer sufficient to utilize all available resources. As a result, GPU …