- Academic Search

S Mostofi, H Falahati, N Mahani… - Proceedings of the 56th …, 2023 - dl.acm.org

Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …

Save Cite Cited by 5 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] hodjat.me

Regmutex: Inter-warp gpu register time-sharing

F Khorasani, HA Esfeden… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org

Registers are the fastest and simultaneously the most expensive kind of memory available to
GPU threads. Due to existence of a great number of concurrently executing threads, and the …

Save Cite Cited by 57 Related articles All 10 versions Free GPT-4

Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit

MK Yoon, K Kim, S Lee, WW Ro… - ACM SIGARCH Computer …, 2016 - dl.acm.org

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive
amount of processing resources. However, thread concurrency in GPUs can be diminished …

Save Cite Cited by 68 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] northwestern.edu

GhOST: a GPU out-of-order scheduling technique for stall reduction

I Chaturvedi, BR Godala, Y Wu, Z Xu… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org

Graphics Processing Units (GPUs) use massive multi-threading coupled with static
scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge …

Save Cite Cited by 1 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] github.io

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

M Khairy, AG Wassal, M Zahran - Journal of Parallel and Distributed …, 2019 - Elsevier

With the skyrocketing advances of process technology, the increased need to process huge
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …

Save Cite Cited by 36 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] iastate.edu

APRES: Improving cache efficiency by exploiting load characteristics on GPUs

Y Oh, K Kim, MK Yoon, JH Park, Y Park… - ACM SIGARCH …, 2016 - dl.acm.org

Long memory latency and limited throughput become performance bottlenecks of GPGPU
applications. The latency takes hundreds of cycles which is difficult to be hidden by simply …

Save Cite Cited by 45 Related articles All 9 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Drgpum: Guiding memory optimization for gpu-accelerated applications

M Lin, K Zhou, P Su - Proceedings of the 28th ACM International …, 2023 - dl.acm.org

GPUs are widely used in today's computing platforms to accelerate applications in various
domains. However, scarce GPU memory resources are often the dominant limiting factor in …

Save Cite Cited by 4 Related articles All 3 versions Free GPT-4

WIR: Warp instruction reuse to minimize repeated computations in GPUs

K Kim, WW Ro - 2018 IEEE International Symposium on High …, 2018 - ieeexplore.ieee.org

Warp instructions with an identical arithmetic operation on same input values produce the
identical computation results. This paper proposes warp instruction reuse to allow such …

Save Cite Cited by 25 Related articles All 3 versions Free GPT-4

Cta-aware prefetching and scheduling for gpu

G Koo, H Jeon, Z Liu, NS Kim… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

Albeit GPUs are supposed to be tolerant to long latency of data fetch operation, we observe
that L1 cache misses occur in a bursty manner for many memory-intensive applications. This …

Save Cite Cited by 27 Related articles All 4 versions Free GPT-4

FineReg: Fine-grained register file management for augmenting GPU throughput

Y Oh, MK Yoon, WJ Song… - 2018 51st Annual IEEE …, 2018 - ieeexplore.ieee.org

Graphics processing units (GPUs) include a large amount of hardware resources for parallel
thread executions. However, the resources are not fully utilized during runtime, and …

Save Cite Cited by 23 Related articles All 5 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Snake: A variable-length chain-based prefetching for gpus

Regmutex: Inter-warp gpu register time-sharing

Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit

GhOST: a GPU out-of-order scheduling technique for stall reduction

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

APRES: Improving cache efficiency by exploiting load characteristics on GPUs

Drgpum: Guiding memory optimization for gpu-accelerated applications

WIR: Warp instruction reuse to minimize repeated computations in GPUs

Cta-aware prefetching and scheduling for gpu

FineReg: Fine-grained register file management for augmenting GPU throughput