Snake: A variable-length chain-based prefetching for gpus
Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …
Regmutex: Inter-warp gpu register time-sharing
Registers are the fastest and simultaneously the most expensive kind of memory available to
GPU threads. Due to existence of a great number of concurrently executing threads, and the …
GPU threads. Due to existence of a great number of concurrently executing threads, and the …
Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit
Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive
amount of processing resources. However, thread concurrency in GPUs can be diminished …
amount of processing resources. However, thread concurrency in GPUs can be diminished …
GhOST: a GPU out-of-order scheduling technique for stall reduction
Graphics Processing Units (GPUs) use massive multi-threading coupled with static
scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge …
scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge …
A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
With the skyrocketing advances of process technology, the increased need to process huge
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …
APRES: Improving cache efficiency by exploiting load characteristics on GPUs
Long memory latency and limited throughput become performance bottlenecks of GPGPU
applications. The latency takes hundreds of cycles which is difficult to be hidden by simply …
applications. The latency takes hundreds of cycles which is difficult to be hidden by simply …
Drgpum: Guiding memory optimization for gpu-accelerated applications
GPUs are widely used in today's computing platforms to accelerate applications in various
domains. However, scarce GPU memory resources are often the dominant limiting factor in …
domains. However, scarce GPU memory resources are often the dominant limiting factor in …
WIR: Warp instruction reuse to minimize repeated computations in GPUs
Warp instructions with an identical arithmetic operation on same input values produce the
identical computation results. This paper proposes warp instruction reuse to allow such …
identical computation results. This paper proposes warp instruction reuse to allow such …
Cta-aware prefetching and scheduling for gpu
Albeit GPUs are supposed to be tolerant to long latency of data fetch operation, we observe
that L1 cache misses occur in a bursty manner for many memory-intensive applications. This …
that L1 cache misses occur in a bursty manner for many memory-intensive applications. This …
FineReg: Fine-grained register file management for augmenting GPU throughput
Graphics processing units (GPUs) include a large amount of hardware resources for parallel
thread executions. However, the resources are not fully utilized during runtime, and …
thread executions. However, the resources are not fully utilized during runtime, and …