Gram‐Schmidt orthogonalization: 100 years and more
SUMMARY In 1907, Erhard Schmidt published a paper in which he introduced an
orthogonalization algorithm that has since become known as the classical Gram‐Schmidt …
orthogonalization algorithm that has since become known as the classical Gram‐Schmidt …
Towards dense linear algebra for hybrid GPU accelerated manycore systems
We highlight the trends leading to the increased appeal of using hybrid multicore+ GPU
systems for high performance computing. We present a set of techniques that can be used to …
systems for high performance computing. We present a set of techniques that can be used to …
Communication-optimal parallel and sequential QR and LU factorizations
We present parallel and sequential dense QR factorization algorithms that are both optimal
(up to polylogarithmic factors) in the amount of communication they perform and just as …
(up to polylogarithmic factors) in the amount of communication they perform and just as …
System identification at the extreme edge for network load reduction in vibration-based monitoring
Mechanical complexity, wide dimensions, and big data volume may hamper the
implementation of Internet of Things (IoT)-enabled structural health monitoring (SHM) …
implementation of Internet of Things (IoT)-enabled structural health monitoring (SHM) …
Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters
This paper presents a unified communication optimization framework for sparse triangular
solve (SpTRSV) algorithms on CPU and GPU clusters. The framework builds upon a 3D …
solve (SpTRSV) algorithms on CPU and GPU clusters. The framework builds upon a 3D …
Optimizing Halley's iteration for computing the matrix polar decomposition
We introduce a dynamically weighted Halley (DWH) iteration for computing the polar
decomposition of a matrix, and we prove that the new method is globally and asymptotically …
decomposition of a matrix, and we prove that the new method is globally and asymptotically …
CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system
Designing communication-avoiding algorithms is crucial for high performance computing on
a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm …
a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm …
[PDF][PDF] Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product.
This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative
eigensolver–the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For …
eigensolver–the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For …
Algorithm 980: Sparse QR factorization on the GPU
Sparse matrix factorization involves a mix of regular and irregular computation, which is a
particular challenge when trying to obtain high-performance on the highly parallel general …
particular challenge when trying to obtain high-performance on the highly parallel general …
On the performance and energy efficiency of sparse linear algebra on GPUs
In this paper we unveil some performance and energy efficiency frontiers for sparse
computations on GPU-based supercomputers. We compare the resource efficiency of …
computations on GPU-based supercomputers. We compare the resource efficiency of …