Preparing sparse solvers for exascale computing
Sparse solvers provide essential functionality for a wide variety of scientific applications.
Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi …
Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi …
The kokkos ecosystem: Comprehensive performance portability for high performance computing
State-of-the-art engineering and science codes have grown in complexity dramatically over
the last two decades. Application teams have adopted more sophisticated development …
the last two decades. Application teams have adopted more sophisticated development …
Kokkos kernels: Performance portable sparse/dense linear algebra and graph kernels
S Rajamanickam, S Acer, L Berger-Vergiat… - ar**
Computational Science and Engineering (CSE) applications depend on performance …
Computational Science and Engineering (CSE) applications depend on performance …
Software for sparse tensor decomposition on emerging computing architectures
In this paper, we develop software for decomposing sparse tensors that is portable to and
performant on a variety of multicore, manycore, and GPU computing architectures. The result …
performant on a variety of multicore, manycore, and GPU computing architectures. The result …
Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs
A Abdelfattah, S Tomov… - 2019 IEEE international …, 2019 - ieeexplore.ieee.org
Matrix multiplication (GEMM) is the most important operation in dense linear algebra.
Because it is a computebound operation that is rich in data reuse, many applications from …
Because it is a computebound operation that is rich in data reuse, many applications from …
Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication
There is a growing interest in custom spatial accelerators for machine learning applications.
These accelerators employ a spatial array of processing elements (PEs) interacting via …
These accelerators employ a spatial array of processing elements (PEs) interacting via …
Improving scalability of parallel CNN training by adjusting mini-batch size at run-time
Training Convolutional Neural Network (CNN) is a computationally intensive task, requiring
efficient parallelization to shorten the execution time. Considering the ever-increasing size of …
efficient parallelization to shorten the execution time. Considering the ever-increasing size of …
Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers
Many scientific applications rely on sparse direct solvers for their numerical robustness.
However, performance optimization for these solvers remains a challenging task, especially …
However, performance optimization for these solvers remains a challenging task, especially …
Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors
Factorization and multiplication of dense matrices and tensors are critical, yet extremely
expensive pieces of the scientific toolbox. Careful use of low rank approximation can …
expensive pieces of the scientific toolbox. Careful use of low rank approximation can …
Speeding up particle track reconstruction using a parallel Kalman filter algorithm
S Lantz, K McDermott, M Reid, D Riley… - Journal of …, 2020 - iopscience.iop.org
One of the most computationally challenging problems expected for the High-Luminosity
Large Hadron Collider (HL-LHC) is determining the trajectory of charged particles during …
Large Hadron Collider (HL-LHC) is determining the trajectory of charged particles during …