Gram‐Schmidt orthogonalization: 100 years and more

SJ Leon, Å Björck, W Gander - Numerical Linear Algebra with …, 2013 - Wiley Online Library
SUMMARY In 1907, Erhard Schmidt published a paper in which he introduced an
orthogonalization algorithm that has since become known as the classical Gram‐Schmidt …

Towards dense linear algebra for hybrid GPU accelerated manycore systems

S Tomov, J Dongarra, M Baboulin - Parallel Computing, 2010 - Elsevier
We highlight the trends leading to the increased appeal of using hybrid multicore+ GPU
systems for high performance computing. We present a set of techniques that can be used to …

Communication-optimal parallel and sequential QR and LU factorizations

J Demmel, L Grigori, M Hoemmen, J Langou - SIAM Journal on Scientific …, 2012 - SIAM
We present parallel and sequential dense QR factorization algorithms that are both optimal
(up to polylogarithmic factors) in the amount of communication they perform and just as …

System identification at the extreme edge for network load reduction in vibration-based monitoring

F Zonzini, V Dertimanis, E Chatzi… - IEEE Internet of Things …, 2022 - ieeexplore.ieee.org
Mechanical complexity, wide dimensions, and big data volume may hamper the
implementation of Internet of Things (IoT)-enabled structural health monitoring (SHM) …

Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters

Y Liu, N Ding, P Sao, S Williams, XS Li - Proceedings of the International …, 2023 - dl.acm.org
This paper presents a unified communication optimization framework for sparse triangular
solve (SpTRSV) algorithms on CPU and GPU clusters. The framework builds upon a 3D …

Optimizing Halley's iteration for computing the matrix polar decomposition

Y Nakatsukasa, Z Bai, F Gygi - SIAM Journal on Matrix Analysis and …, 2010 - SIAM
We introduce a dynamically weighted Halley (DWH) iteration for computing the polar
decomposition of a matrix, and we prove that the new method is globally and asymptotically …

CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system

T Fukaya, Y Nakatsukasa… - 2014 5th workshop …, 2014 - ieeexplore.ieee.org
Designing communication-avoiding algorithms is crucial for high performance computing on
a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm …

[PDF][PDF] Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product.

H Anzt, S Tomov, JJ Dongarra - SpringSim (HPS), 2015 - researchgate.net
This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative
eigensolver–the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For …

Algorithm 980: Sparse QR factorization on the GPU

SN Yeralan, TA Davis, WM Sid-Lakhdar… - ACM Transactions on …, 2017 - dl.acm.org
Sparse matrix factorization involves a mix of regular and irregular computation, which is a
particular challenge when trying to obtain high-performance on the highly parallel general …

On the performance and energy efficiency of sparse linear algebra on GPUs

H Anzt, S Tomov, J Dongarra - The International Journal of …, 2017 - journals.sagepub.com
In this paper we unveil some performance and energy efficiency frontiers for sparse
computations on GPU-based supercomputers. We compare the resource efficiency of …