A survey of techniques for architecting and managing GPU register file

S Mittal - IEEE Transactions on Parallel and Distributed …, 2016 - ieeexplore.ieee.org
To support their massively-multithreaded architecture, GPUs use very large register file (RF)
which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs …

Convstencil: Transform stencil computation to matrix multiplication on tensor cores

Y Chen, K Li, Y Wang, D Bai, L Wang, L Ma… - Proceedings of the 29th …, 2024 - dl.acm.org
Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors
to enhance matrix multiplication performance. However, constrained to its over-specification …

Machine learning based auto-tuning for enhanced opencl performance portability

TL Falch, AC Elster - 2015 IEEE International Parallel and …, 2015 - ieeexplore.ieee.org
Heterogeneous computing, which combines devices with different architectures, is rising in
popularity, and promises increased performance combined with reduced energy …

Warp-consolidation: A novel execution model for gpus

A Li, W Liu, L Wang, K Barker, SL Song - Proceedings of the 2018 …, 2018 - dl.acm.org
With the unprecedented development of compute capability and extension of memory
bandwidth on modern GPUs, parallel communication and synchronization soon becomes a …

Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications

TL Falch, AC Elster - Concurrency and Computation: Practice …, 2017 - Wiley Online Library
Heterogeneous computing, combining devices with different architectures such as CPUs
and GPUs, is rising in popularity and promises increased performance combined with …

Gpu-unicache: Automatic code generation of spatial blocking for stencils on gpus

K Hou, H Wang, W Feng - Proceedings of the computing frontiers …, 2017 - dl.acm.org
Spatial blocking is a critical memory-access optimization to efficiently exploit the computing
resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data …

LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores

Y Zhang, K Li, L Yuan, J Cheng… - … Conference for High …, 2024 - ieeexplore.ieee.org
Stencil computations play a pivotal role in numerous scientific and industrial applications,
yet their efficient execution on specialized hardware accelerators like Tensor Core Units …

Moirae: Generating High-Performance Composite Stencil Programs with Global Optimizations

X Liu, X Yang, K Ma, S Liu, K Zhang… - … Conference for High …, 2024 - ieeexplore.ieee.org
Stencil computation is one of the most universal computation motifs in scientific applications
such as weather prediction. Due to the complexity of scientific simulation, the stencil …

Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs

A Charara, D Keyes, H Ltaief - ACM Transactions on Mathematical …, 2019 - dl.acm.org
Batched dense linear algebra kernels are becoming ubiquitous in scientific applications,
ranging from tensor contractions in deep learning to data compression in hierarchical low …

Memory access optimization of high-order CFD stencil computations on GPU

S Wang, Z Li, Y Che - … , PDCAT 2020, Shenzhen, China, December 28–30 …, 2021 - Springer
Stencils computations are a class of computations commonly found in scientific and
engineering applications. They have relatively lower arithmetic intensity. Therefore, their …