Dasp: Specific dense matrix multiply-accumulate units accelerated general sparse matrix-vector multiplication

Y Lu, W Liu - Proceedings of the International Conference for High …, 2023 - dl.acm.org
Sparse matrix-vector multiplication (SpMV) plays a key role in computational science and
engineering, graph processing, and machine learning applications. Much work on SpMV …

Convstencil: Transform stencil computation to matrix multiplication on tensor cores

Y Chen, K Li, Y Wang, D Bai, L Wang, L Ma… - Proceedings of the 29th …, 2024 - dl.acm.org
Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors
to enhance matrix multiplication performance. However, constrained to its over-specification …

Adaptive auto-tuning framework for global exploration of stencil optimization on gpus

Q Sun, Y Liu, H Yang, Z Jiang, Z Luan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Stencil computations are widely used in high performance computing (HPC) applications.
Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil …

Amgt: Algebraic multigrid solver on tensor cores

Y Lu, L Zeng, T Wang, X Fu, W Li… - … Conference for High …, 2024 - ieeexplore.ieee.org
Algebraic multigrid (AMG) methods are particularly efficient to solve a wide range of sparse
linear systems, due to their good flexibility and adaptability. Even though modern parallel …

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

J Shen, L Long, X Deng, M Okita, F Ino - The Journal of Supercomputing, 2023 - Springer
A code for out-of-core stencil computation manages data that exceeds the memory capacity
of a GPU. However, such a code necessitates frequent data transfers between the CPU and …

Accelerating range minimum queries with ray tracing cores

E Meneses, CA Navarro, H Ferrada… - Future Generation …, 2024 - Elsevier
Over the past decade, GPU technology has undergone a notable transformation, evolving
from pure general-purpose computation to the integration of application-specific integrated …

Revisiting temporal blocking stencil optimizations

L Zhang, M Wahib, P Chen, J Meng, X Wang… - Proceedings of the 37th …, 2023 - dl.acm.org
Iterative stencils are used widely across the spectrum of High Performance Computing
(HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given …

Mixed-precision block incomplete sparse approximate preconditioner on Tensor core

H Zhang, W Ma, W Yuan, J Zhang, Z Lu - CCF Transactions on High …, 2024 - Springer
In this paper, we propose and implement a mixed-precision Block-ISAI preconditioner for
solving linear systems from multiphysics areas. By leveraging FP32 computing, our …

Mad macce: Supporting multiply-add operations for democratizing matrix-multiplication accelerators

S Sung, S Hur, S Kim, D Ha, Y Oh, WW Ro - Proceedings of the 56th …, 2023 - dl.acm.org
Modern GPUs commonly employ specialized matrix multiplication units (MXUs) to
accelerate matrix multiplication, the core computation of deep learning workloads. However …

High Performance Unstructured SpMM Computation Using Tensor Cores

P Okanovic, G Kwasniewski, PS Labini… - … Conference for High …, 2024 - ieeexplore.ieee.org
High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and
industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet …