Fast convolutional nets with fbfft: A GPU performance evaluation
We examine the performance profile of Convolutional Neural Network training on the current
generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier …
generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier …
BLIS: A framework for rapidly instantiating BLAS functionality
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for
rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental …
rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental …
Anatomy of high-performance matrix multiplication
K Goto, RA Geijn - ACM Transactions on Mathematical Software (TOMS), 2008 - dl.acm.org
We present the basic principles that underlie the high-performance implementation of the
matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design …
matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design …
[書籍][B] Automatic performance tuning of sparse matrix kernels
RW Vuduc - 2003 - search.proquest.com
This dissertation presents an automated system to generate highly efficient, platform-
adapted implementations of sparse matrix kernels. We show that conventional …
adapted implementations of sparse matrix kernels. We show that conventional …
FLAME: Formal linear algebra methods environment
JA Gunnels, FG Gustavson, GM Henry… - ACM Transactions on …, 2001 - dl.acm.org
Since the advent of high-performance distributed-memory parallel computing, the need for
intelligible code has become ever greater. The development and maintenance of libraries …
intelligible code has become ever greater. The development and maintenance of libraries …
Analytical modeling is enough for high-performance BLIS
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides
a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation …
a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation …
Rotation left digits to enhance the security level of message blocks cryptography
Due to the availability of several social media platforms and their use in sending text
messages, it is necessary to provide an easy and safe way to protect messages from being …
messages, it is necessary to provide an easy and safe way to protect messages from being …
High performance zero-memory overhead direct convolutions
The computation of convolution layers in deep neural networks typically rely on high
performance routines that trade space for time by using additional memory (either for …
performance routines that trade space for time by using additional memory (either for …
Design of a high-performance GEMM-like tensor–tensor multiplication
P Springer, P Bientinesi - ACM Transactions on Mathematical Software …, 2018 - dl.acm.org
We present “GEMM-like Tensor–Tensor multiplication”(GETT), a novel approach for dense
tensor contractions that mirrors the design of a high-performance general matrix–matrix …
tensor contractions that mirrors the design of a high-performance general matrix–matrix …
High-performance tensor contraction without transposition
DA Matthews - SIAM Journal on Scientific Computing, 2018 - SIAM
Tensor computations---in particular tensor contraction (TC)---are important kernels in many
scientific computing applications. Due to the fundamental similarity of TC to matrix …
scientific computing applications. Due to the fundamental similarity of TC to matrix …