Snake: A variable-length chain-based prefetching for gpus

S Mostofi, H Falahati, N Mahani… - Proceedings of the 56th …, 2023 - dl.acm.org
Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …

Regmutex: Inter-warp gpu register time-sharing

F Khorasani, HA Esfeden… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org
Registers are the fastest and simultaneously the most expensive kind of memory available to
GPU threads. Due to existence of a great number of concurrently executing threads, and the …

Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit

MK Yoon, K Kim, S Lee, WW Ro… - ACM SIGARCH Computer …, 2016 - dl.acm.org
Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive
amount of processing resources. However, thread concurrency in GPUs can be diminished …

GhOST: a GPU out-of-order scheduling technique for stall reduction

I Chaturvedi, BR Godala, Y Wu, Z Xu… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) use massive multi-threading coupled with static
scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge …

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

M Khairy, AG Wassal, M Zahran - Journal of Parallel and Distributed …, 2019 - Elsevier
With the skyrocketing advances of process technology, the increased need to process huge
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …

APRES: Improving cache efficiency by exploiting load characteristics on GPUs

Y Oh, K Kim, MK Yoon, JH Park, Y Park… - ACM SIGARCH …, 2016 - dl.acm.org
Long memory latency and limited throughput become performance bottlenecks of GPGPU
applications. The latency takes hundreds of cycles which is difficult to be hidden by simply …

Drgpum: Guiding memory optimization for gpu-accelerated applications

M Lin, K Zhou, P Su - Proceedings of the 28th ACM International …, 2023 - dl.acm.org
GPUs are widely used in today's computing platforms to accelerate applications in various
domains. However, scarce GPU memory resources are often the dominant limiting factor in …

WIR: Warp instruction reuse to minimize repeated computations in GPUs

K Kim, WW Ro - 2018 IEEE International Symposium on High …, 2018 - ieeexplore.ieee.org
Warp instructions with an identical arithmetic operation on same input values produce the
identical computation results. This paper proposes warp instruction reuse to allow such …

Cta-aware prefetching and scheduling for gpu

G Koo, H Jeon, Z Liu, NS Kim… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Albeit GPUs are supposed to be tolerant to long latency of data fetch operation, we observe
that L1 cache misses occur in a bursty manner for many memory-intensive applications. This …

FineReg: Fine-grained register file management for augmenting GPU throughput

Y Oh, MK Yoon, WJ Song… - 2018 51st Annual IEEE …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) include a large amount of hardware resources for parallel
thread executions. However, the resources are not fully utilized during runtime, and …