- Academic Search

Z Jia, M Maggioni, B Staiger, DP Scarpazza - arxiv preprint arxiv …, 2018 - arxiv.org

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and
technological progression, coupled with a reluctance by manufacturers to disclose low-level …

Uložit Citovat Počet citací tohoto článku: 389 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

L Ma, Z **e, Z Yang, J Xue, Y Miao, W Cui… - … USENIX Symposium on …, 2020 - usenix.org

Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is
challenging. Existing DNN frameworks and compilers often treat the DNN operators in a …

Uložit Citovat Počet citací tohoto článku: 159 Související články Všechny verze (počet: 8) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] ust.hk

Demystifying tensor cores to optimize half-precision matrix multiply

D Yan, W Wang, X Chu - 2020 IEEE International Parallel and …, 2020 - ieeexplore.ieee.org

Half-precision matrix multiply has played a key role in the training of deep learning models.
The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small …

Uložit Citovat Počet citací tohoto článku: 94 Související články Všechny verze (počet: 9)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CLTune: A generic auto-tuner for OpenCL kernels

C Nugteren, V Codreanu - 2015 IEEE 9th International …, 2015 - ieeexplore.ieee.org

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel
performance of a generic, user-defined search space of possible parameter-value …

Uložit Citovat Počet citací tohoto článku: 175 Související články Všechny verze (počet: 9)

[Free GPT-4]
[DeepSeek]

[PDF] utk.edu

Performance, design, and autotuning of batched GEMM for GPUs

A Abdelfattah, A Haidar, S Tomov… - … Conference, ISC High …, 2016 - Springer

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …

Uložit Citovat Počet citací tohoto článku: 152 Související články Všechny verze (počet: 10)

[KNIHA][B] General-purpose graphics processor architectures

TM Aamodt, WWL Fung, TG Rogers, M Martonosi - 2018 - Springer

Originally developed to support video games, graphics processor units (GPUs) are now
increasingly used for general-purpose (non-graphics) applications ranging from machine …

Uložit Citovat Počet citací tohoto článku: 107 Související články Všechny verze (počet: 4) Hledat knihovnu

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CLBlast: A tuned OpenCL BLAS library

C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …

Uložit Citovat Počet citací tohoto článku: 112 Související články Všechny verze (počet: 3)

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

GPU register file virtualization

H Jeon, GS Ravi, NS Kim, M Annavaram - Proceedings of the 48th …, 2015 - dl.acm.org

To support massive number of parallel thread contexts, Graphics Processing Units (GPUs)
use a huge register file, which is responsible for a large fraction of GPU's total power and …

Uložit Citovat Počet citací tohoto článku: 126 Související články Všechny verze (počet: 8)

[Free GPT-4]
[DeepSeek]

[PDF] ust.hk

Optimizing batched winograd convolution on GPUs

D Yan, W Wang, X Chu - Proceedings of the 25th ACM SIGPLAN …, 2020 - dl.acm.org

In this paper, we present an optimized implementation for single-precision Winograd
convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd …

Uložit Citovat Počet citací tohoto článku: 74 Související články Všechny verze (počet: 8)

[Free GPT-4]
[DeepSeek]

[PDF] archive.org

Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org

GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …

Uložit Citovat Počet citací tohoto článku: 127 Související články Všechny verze (počet: 6)

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Dissecting the NVIDIA volta GPU architecture via microbenchmarking

Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

Demystifying tensor cores to optimize half-precision matrix multiply

CLTune: A generic auto-tuner for OpenCL kernels

Performance, design, and autotuning of batched GEMM for GPUs

[KNIHA][B] General-purpose graphics processor architectures

CLBlast: A tuned OpenCL BLAS library

GPU register file virtualization

Optimizing batched winograd convolution on GPUs

Scalable kernel fusion for memory-bound GPU applications