Zorua: A holistic approach to resource virtualization in GPUs
This paper introduces a new resource virtualization framework, Zorua, that decouples the
programmer-specified resource usage of a GPU application from the actual allocation in the …
programmer-specified resource usage of a GPU application from the actual allocation in the …
Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit
Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive
amount of processing resources. However, thread concurrency in GPUs can be diminished …
amount of processing resources. However, thread concurrency in GPUs can be diminished …
Warped-preexecution: A GPU pre-execution approach for improving latency hiding
This paper presents a pre-execution approach for improving GPU performance, called P-
mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding …
mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding …
Regless: Just-in-time operand staging for GPUs
J Kloosterman, J Beaumont, DA Jamshidi… - Proceedings of the 50th …, 2017 - dl.acm.org
The register file is one of the largest and most power-hungry structures in a Graphics
Processing Unit (GPU), because massive multithreading requires all the register state for …
Processing Unit (GPU), because massive multithreading requires all the register state for …
Hierarchical register file at a graphics processing unit
Y Eckert, N Jayasena - US Patent 10,853,904, 2020 - Google Patents
A processor employs a hierarchical register file for a graph ics processing unit (GPU). A top
level of the hierarchical register file is stored at a local memory of the GPU (eg, a memory on …
level of the hierarchical register file is stored at a local memory of the GPU (eg, a memory on …
Many-BSP: an analytical performance model for CUDA kernels
The unknown behavior of GPUs and the differing characteristics among their generations
present a serious challenge in the analysis and optimization of programs in these …
present a serious challenge in the analysis and optimization of programs in these …
A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in GPGPUs
General-Purpose Graphic Processing Units (GPGPU) have been widely used in high
performance computing as application accelerators due to their massive parallelism and …
performance computing as application accelerators due to their massive parallelism and …
Unified on-chip memory allocation for SIMT architecture
The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the
tremendous concurrency enabled by its underlying architecture--single instruction multiple …
tremendous concurrency enabled by its underlying architecture--single instruction multiple …
Phase aware warp scheduling: Mitigating effects of phase behavior in gpgpu applications
Graphics Processing Units (GPUs) have been widely adopted as accelerators for high
performance computing due to the immense amount of computational throughput they offer …
performance computing due to the immense amount of computational throughput they offer …
Efficient exception handling support for GPUs
Operating systems have long relied on the exception handling mechanism to implement
numerous virtual memory features and optimizations. However, today's GPUs have a limited …
numerous virtual memory features and optimizations. However, today's GPUs have a limited …