Dissecting GPU memory hierarchy through microbenchmarking

X Mei, X Chu - IEEE Transactions on Parallel and Distributed …, 2016 - ieeexplore.ieee.org
Memory access efficiency is a key factor in fully utilizing the computational power of graphics
processing units (GPUs). However, many details of the GPU memory hierarchy are not …

Duality cache for data parallel acceleration

D Fujiki, S Mahlke, R Das - … of the 46th International Symposium on …, 2019 - dl.acm.org
Duality Cache is an in-cache computation architecture that enables general purpose data
parallel applications to run on caches. This paper presents a holistic approach of building …

Adaptive cache management for energy-efficient GPU computing

X Chen, LW Chang, CI Rodrigues, J Lv… - 2014 47th Annual …, 2014 - ieeexplore.ieee.org
With the SIMT execution model, GPUs can hide memory latency through massive
multithreading for many applications that have regular memory access patterns. To support …

Coordinated static and dynamic cache bypassing for GPUs

X **e, Y Liang, Y Wang, G Sun… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
The massive parallel architecture enables graphics processing units (GPUs) to boost
performance for a wide range of applications. Initially, GPUs only employ scratchpad …

A survey of cache bypassing techniques

S Mittal - Journal of Low Power Electronics and Applications, 2016 - mdpi.com
With increasing core-count, the cache demand of modern processors has also increased.
However, due to strict area/power budgets and presence of poor data-locality workloads …

Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency

R Ausavarungnirun, V Miller, J Landgraf… - ACM SIGPLAN …, 2018 - dl.acm.org
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to
provide high instruction throughput and to efficiently hide long-latency stalls. The resulting …

Locality-driven dynamic GPU cache bypassing

C Li, SL Song, H Dai, A Sidelnik, SKS Hari… - Proceedings of the 29th …, 2015 - dl.acm.org
This paper presents novel cache optimizations for massively parallel, throughput-oriented
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …

Gnnmark: A benchmark suite to characterize graph neural network training on gpus

T Baruah, K Shivdikar, S Dong, Y Sun… - … Analysis of Systems …, 2021 - ieeexplore.ieee.org
Graph Neural Networks (GNNs) have emerged as a promising class of Machine Learning
algorithms to train on non-euclidean data. GNNs are widely used in recommender systems …

Locality-aware CTA clustering for modern GPUs

A Li, SL Song, W Liu, X Liu, A Kumar… - ACM SIGARCH …, 2017 - dl.acm.org
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern
GPUs is often awkward. The locality among global memory requests from different SMs …

CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

SY Lee, A Arunkumar, CJ Wu - ACM SIGARCH Computer Architecture …, 2015 - dl.acm.org
The ubiquity of graphics processing unit (GPU) architectures has made them efficient
alternatives to chip-multiprocessors for parallel workloads. GPUs achieve superior …