Dissecting the NVIDIA volta GPU architecture via microbenchmarking

Z Jia, M Maggioni, B Staiger, DP Scarpazza - arxiv preprint arxiv …, 2018 - arxiv.org
Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and
technological progression, coupled with a reluctance by manufacturers to disclose low-level …

Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

L Ma, Z **e, Z Yang, J Xue, Y Miao, W Cui… - … USENIX Symposium on …, 2020 - usenix.org
Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is
challenging. Existing DNN frameworks and compilers often treat the DNN operators in a …

Demystifying tensor cores to optimize half-precision matrix multiply

D Yan, W Wang, X Chu - 2020 IEEE International Parallel and …, 2020 - ieeexplore.ieee.org
Half-precision matrix multiply has played a key role in the training of deep learning models.
The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small …

CLTune: A generic auto-tuner for OpenCL kernels

C Nugteren, V Codreanu - 2015 IEEE 9th International …, 2015 - ieeexplore.ieee.org
This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel
performance of a generic, user-defined search space of possible parameter-value …

Performance, design, and autotuning of batched GEMM for GPUs

A Abdelfattah, A Haidar, S Tomov… - … Conference, ISC High …, 2016 - Springer
The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …

[KNIHA][B] General-purpose graphics processor architectures

Originally developed to support video games, graphics processor units (GPUs) are now
increasingly used for general-purpose (non-graphics) applications ranging from machine …

CLBlast: A tuned OpenCL BLAS library

C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …

GPU register file virtualization

H Jeon, GS Ravi, NS Kim, M Annavaram - Proceedings of the 48th …, 2015 - dl.acm.org
To support massive number of parallel thread contexts, Graphics Processing Units (GPUs)
use a huge register file, which is responsible for a large fraction of GPU's total power and …

Optimizing batched winograd convolution on GPUs

D Yan, W Wang, X Chu - Proceedings of the 25th ACM SIGPLAN …, 2020 - dl.acm.org
In this paper, we present an optimized implementation for single-precision Winograd
convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd …

Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org
GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …