The Impact of Space-Filling Curves on Data Movement in Parallel Systems

D Walker, A Skjellum - arxiv preprint arxiv:2307.07828, 2023 - arxiv.org
Modern computer systems are characterized by deep memory hierarchies, composed of
main memory, multiple layers of cache, and other specialized types of memory. In parallel …

Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing

PJ Martinez-Ferrer, AJ Yzelman, V Beltran - Future Generation Computer …, 2025 - Elsevier
The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a
core component of the higher-order power method (HOPM). This paper brings distributed …

A heterogeneous parallel computing approach optimizing SpTTM on CPU-GPU via GCN

H Wang, W Yang, R Ouyang, R Hu, K Li… - ACM Transactions on …, 2023 - dl.acm.org
Sparse Tensor-Times-Matrix (SpTTM) is the core calculation in tensor analysis. The sparse
distributions of different tensors vary greatly, which poses a big challenge to designing …

Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays

SN Swatman, AL Varbanescu, AD Pimentel… - Proceedings of the 15th …, 2024 - dl.acm.org
The layout of multi-dimensional data can have a significant impact on the efficacy of
hardware caches and, by extension, the performance of applications. Common multi …

Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms

SN Swatman, AL Varbanescu, AD Pimentel… - arxiv preprint arxiv …, 2023 - arxiv.org
The layout of multi-dimensional data can have a significant impact on the efficacy of
hardware caches and, by extension, the performance of applications. Common multi …

Improved Data Locality Using Morton-order Curve on the Example of LU Decomposition

M Perdacher, C Plant, C Böhm - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
The LU decomposition is an essential element used in many linear algebra applications.
Furthermore, it is used in LINPACK to benchmark the performance of modern multi-core …

Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS

CS Başsoy - International Conference on Computational Science, 2024 - Springer
The tensor-matrix multiplication is a basic tensor operation required by various tensor
methods such as the ALS and the HOSVD. This paper presents flexible high-performance …

High performance tensor–vector multiplication on shared-memory systems

F Pawłowski, B Uçar, AJ Yzelman - International Conference on Parallel …, 2019 - Springer
Tensor–vector multiplication is one of the core components in tensor computations. We have
recently investigated high performance, single core implementation of this bandwidth-bound …

A native tensor–vector multiplication algorithm for high performance computing

PJ Martinez-Ferrer, AN Yzelman… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Tensor computations are important mathematical operations for applications that rely on
multidimensional data. The tensor–vector multiplication (TVM) is the most memory-bound …

High performance tensor-vector multiplies on shared memory systems

F Pawłowski, B Uçar, AJ Yzelman - 2019 - inria.hal.science
Tensor–vector multiplication is one of the core components in tensor computations. We have
recently investigated high performance, single core implementation of this bandwidth-bound …