Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Dissecting the NVIDIA volta GPU architecture via microbenchmarking
Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and
technological progression, coupled with a reluctance by manufacturers to disclose low-level …
technological progression, coupled with a reluctance by manufacturers to disclose low-level …
Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}
Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is
challenging. Existing DNN frameworks and compilers often treat the DNN operators in a …
challenging. Existing DNN frameworks and compilers often treat the DNN operators in a …
Demystifying tensor cores to optimize half-precision matrix multiply
Half-precision matrix multiply has played a key role in the training of deep learning models.
The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small …
The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small …
CLTune: A generic auto-tuner for OpenCL kernels
C Nugteren, V Codreanu - 2015 IEEE 9th International …, 2015 - ieeexplore.ieee.org
This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel
performance of a generic, user-defined search space of possible parameter-value …
performance of a generic, user-defined search space of possible parameter-value …
Performance, design, and autotuning of batched GEMM for GPUs
The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …
dense linear algebra, and is the key component for obtaining high performance in most …
[KNIHA][B] General-purpose graphics processor architectures
Originally developed to support video games, graphics processor units (GPUs) are now
increasingly used for general-purpose (non-graphics) applications ranging from machine …
increasingly used for general-purpose (non-graphics) applications ranging from machine …
CLBlast: A tuned OpenCL BLAS library
C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …
GPU register file virtualization
To support massive number of parallel thread contexts, Graphics Processing Units (GPUs)
use a huge register file, which is responsible for a large fraction of GPU's total power and …
use a huge register file, which is responsible for a large fraction of GPU's total power and …
Optimizing batched winograd convolution on GPUs
In this paper, we present an optimized implementation for single-precision Winograd
convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd …
convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd …
Scalable kernel fusion for memory-bound GPU applications
M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org
GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …