Dasp: Specific dense matrix multiply-accumulate units accelerated general sparse matrix-vector multiplication
Sparse matrix-vector multiplication (SpMV) plays a key role in computational science and
engineering, graph processing, and machine learning applications. Much work on SpMV …
engineering, graph processing, and machine learning applications. Much work on SpMV …
Convstencil: Transform stencil computation to matrix multiplication on tensor cores
Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors
to enhance matrix multiplication performance. However, constrained to its over-specification …
to enhance matrix multiplication performance. However, constrained to its over-specification …
Adaptive auto-tuning framework for global exploration of stencil optimization on gpus
Stencil computations are widely used in high performance computing (HPC) applications.
Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil …
Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil …
Amgt: Algebraic multigrid solver on tensor cores
Algebraic multigrid (AMG) methods are particularly efficient to solve a wide range of sparse
linear systems, due to their good flexibility and adaptability. Even though modern parallel …
linear systems, due to their good flexibility and adaptability. Even though modern parallel …
A compression-based memory-efficient optimization for out-of-core GPU stencil computation
A code for out-of-core stencil computation manages data that exceeds the memory capacity
of a GPU. However, such a code necessitates frequent data transfers between the CPU and …
of a GPU. However, such a code necessitates frequent data transfers between the CPU and …
Accelerating range minimum queries with ray tracing cores
Over the past decade, GPU technology has undergone a notable transformation, evolving
from pure general-purpose computation to the integration of application-specific integrated …
from pure general-purpose computation to the integration of application-specific integrated …
Revisiting temporal blocking stencil optimizations
Iterative stencils are used widely across the spectrum of High Performance Computing
(HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given …
(HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given …
Mixed-precision block incomplete sparse approximate preconditioner on Tensor core
H Zhang, W Ma, W Yuan, J Zhang, Z Lu - CCF Transactions on High …, 2024 - Springer
In this paper, we propose and implement a mixed-precision Block-ISAI preconditioner for
solving linear systems from multiphysics areas. By leveraging FP32 computing, our …
solving linear systems from multiphysics areas. By leveraging FP32 computing, our …
Mad macce: Supporting multiply-add operations for democratizing matrix-multiplication accelerators
Modern GPUs commonly employ specialized matrix multiplication units (MXUs) to
accelerate matrix multiplication, the core computation of deep learning workloads. However …
accelerate matrix multiplication, the core computation of deep learning workloads. However …
High Performance Unstructured SpMM Computation Using Tensor Cores
High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and
industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet …
industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet …