Simple hardware-efficient long convolutions for sequence modeling

DY Fu, EL Epstein, E Nguyen… - International …, 2023 - proceedings.mlr.press
State space models (SSMs) have high performance on long sequence modeling but require
sophisticated initialization techniques and specialized implementations for high quality and …

Acceleration of tensor-product operations with tensor cores

C Cui - ACM Transactions on Parallel Computing, 2024 - dl.acm.org
In this article, we explore the acceleration of tensor product operations in finite element
methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We …

Bind the gap: Compiling real software to hardware FFT accelerators

J Woodruff, J Armengol-Estapé, S Ainsworth… - Proceedings of the 43rd …, 2022 - dl.acm.org
Specialized hardware accelerators continue to be a source of performance improvement.
However, such specialization comes at a programming price. The fundamental issue is that …

Accelerating range minimum queries with ray tracing cores

E Meneses, CA Navarro, H Ferrada… - Future Generation …, 2024 - Elsevier
Over the past decade, GPU technology has undergone a notable transformation, evolving
from pure general-purpose computation to the integration of application-specific integrated …

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

H Ootomo, R Yokota - Proceedings of the International Conference on …, 2023 - dl.acm.org
Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix
decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix …