Efficient exascale discretizations: High-order finite element methods
Efficient exploitation of exascale architectures requires rethinking of the numerical
algorithms used in many large-scale applications. These architectures favor algorithms that …
algorithms used in many large-scale applications. These architectures favor algorithms that …
{DeepCPU}: Serving {RNN-based} deep learning models 10x faster
Recurrent neural networks (RNNs) are an important class of deep learning (DL) models.
Existing DL frameworks have unsatisfying performance for online serving: many RNN …
Existing DL frameworks have unsatisfying performance for online serving: many RNN …
CLBlast: A tuned OpenCL BLAS library
C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …
A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations
The computational efficiency of a state of the art ab initio quantum transport (QT) solver,
capable of revealing the coupled electrothermal properties of atomically-resolved nano …
capable of revealing the coupled electrothermal properties of atomically-resolved nano …
A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit
In recent years, the heterogeneity of both commodity and supercomputers hardware has
increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often …
increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often …
Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs
A Abdelfattah, S Tomov… - 2019 IEEE international …, 2019 - ieeexplore.ieee.org
Matrix multiplication (GEMM) is the most important operation in dense linear algebra.
Because it is a computebound operation that is rich in data reuse, many applications from …
Because it is a computebound operation that is rich in data reuse, many applications from …
High-performance tensor contractions for GPUs
We present a computational framework for high-performance tensor contractions on GPUs.
High-performance is difficult to obtain using existing libraries, especially for many …
High-performance is difficult to obtain using existing libraries, especially for many …
Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration
With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build
hardware for emerging applications that meet power and performance targets, while …
hardware for emerging applications that meet power and performance targets, while …
A set of batched basic linear algebra subprograms and LAPACK routines
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …
Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs
C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org
Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …