- Academic Search

X Mei, X Chu - IEEE Transactions on Parallel and Distributed …, 2016 - ieeexplore.ieee.org

Memory access efficiency is a key factor in fully utilizing the computational power of graphics
processing units (GPUs). However, many details of the GPU memory hierarchy are not …

Save Cite Cited by 294 Related articles All 5 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Duality cache for data parallel acceleration

D Fujiki, S Mahlke, R Das - … of the 46th International Symposium on …, 2019 - dl.acm.org

Duality Cache is an in-cache computation architecture that enables general purpose data
parallel applications to run on caches. This paper presents a holistic approach of building …

Save Cite Cited by 140 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] github.io

Adaptive cache management for energy-efficient GPU computing

X Chen, LW Chang, CI Rodrigues, J Lv… - 2014 47th Annual …, 2014 - ieeexplore.ieee.org

With the SIMT execution model, GPUs can hide memory latency through massive
multithreading for many applications that have regular memory access patterns. To support …

Save Cite Cited by 211 Related articles All 16 versions Free GPT-4

[Free GPT-4]

[PDF] psu.edu

Coordinated static and dynamic cache bypassing for GPUs

X **e, Y Liang, Y Wang, G Sun… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org

The massive parallel architecture enables graphics processing units (GPUs) to boost
performance for a wide range of applications. Initially, GPUs only employ scratchpad …

Save Cite Cited by 167 Related articles All 10 versions Free GPT-4

[Free GPT-4]

[PDF] mdpi.com

A survey of cache bypassing techniques

S Mittal - Journal of Low Power Electronics and Applications, 2016 - mdpi.com

With increasing core-count, the cache demand of modern processors has also increased.
However, due to strict area/power budgets and presence of poor data-locality workloads …

Save Cite Cited by 49 Related articles All 10 versions Free GPT-4 Cached

[Free GPT-4]

[PDF] acm.org

Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency

R Ausavarungnirun, V Miller, J Landgraf… - ACM SIGPLAN …, 2018 - dl.acm.org

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to
provide high instruction throughput and to efficiently hide long-latency stalls. The resulting …

Save Cite Cited by 117 Related articles All 26 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Locality-driven dynamic GPU cache bypassing

C Li, SL Song, H Dai, A Sidelnik, SKS Hari… - Proceedings of the 29th …, 2015 - dl.acm.org

This paper presents novel cache optimizations for massively parallel, throughput-oriented
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …

Save Cite Cited by 142 Related articles All 6 versions Free GPT-4

Gnnmark: A benchmark suite to characterize graph neural network training on gpus

T Baruah, K Shivdikar, S Dong, Y Sun… - … Analysis of Systems …, 2021 - ieeexplore.ieee.org

Graph Neural Networks (GNNs) have emerged as a promising class of Machine Learning
algorithms to train on non-euclidean data. GNNs are widely used in recommender systems …

Save Cite Cited by 39 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] tu-dresden.de

Locality-aware CTA clustering for modern GPUs

A Li, SL Song, W Liu, X Liu, A Kumar… - ACM SIGARCH …, 2017 - dl.acm.org

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern
GPUs is often awkward. The locality among global memory requests from different SMs …

Save Cite Cited by 96 Related articles All 13 versions Free GPT-4

CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

SY Lee, A Arunkumar, CJ Wu - ACM SIGARCH Computer Architecture …, 2015 - dl.acm.org

The ubiquity of graphics processing unit (GPU) architectures has made them efficient
alternatives to chip-multiprocessors for parallel workloads. GPUs achieve superior …

Save Cite Cited by 124 Related articles All 7 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

An efficient compiler framework for cache bypassing on GPUs

Dissecting GPU memory hierarchy through microbenchmarking

Duality cache for data parallel acceleration

Adaptive cache management for energy-efficient GPU computing

Coordinated static and dynamic cache bypassing for GPUs

A survey of cache bypassing techniques

Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency

Locality-driven dynamic GPU cache bypassing

Gnnmark: A benchmark suite to characterize graph neural network training on gpus

Locality-aware CTA clustering for modern GPUs

CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads