A survey of CPU-GPU heterogeneous computing techniques

S Mittal, JS Vetter - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
As both CPUs and GPUs become employed in a wide range of applications, it has been
acknowledged that both of these Processing Units (PUs) have their unique features and …

Kernel methods through the roof: handling billions of points efficiently

G Meanti, L Carratino, L Rosasco… - Advances in Neural …, 2020 - proceedings.neurips.cc
Kernel methods provide an elegant and principled approach to nonparametric learning, but
so far could hardly be used in large scale problems, since naïve implementations scale …

Dense linear algebra solvers for multicore with GPU accelerators

S Tomov, R Nath, H Ltaief… - 2010 IEEE International …, 2010 - ieeexplore.ieee.org
Solving dense linear systems of equations is a fundamental problem in scientific computing.
Numerical simulations involving complex systems represented in terms of unknown …

PCBDDC: a class of robust dual-primal methods in PETSc

S Zampini - SIAM Journal on Scientific Computing, 2016 - SIAM
A class of preconditioners based on balancing domain decomposition by constraints
methods is introduced in the Portable, Extensible Toolkit for Scientific Computation (PETSc) …

[PDF][PDF] Keeneland: Bringing heterogeneous GPU computing to the computational science community

JS Vetter, R Glassbrook, J Dongarra, K Schwan… - Computing in Science …, 2011 - netlib.org
The Keeneland project—named for a historic thoroughbred horse racing track in Lexington,
Kentucky—is a five-year Track 2D grant awarded by the US National Science Foundation …

Data-aware task scheduling on multi-accelerator based platforms

C Augonnet, J Clet-Ortega, S Thibault… - 2010 IEEE 16th …, 2010 - ieeexplore.ieee.org
To fully tap into the potential of heterogeneous machines composed of multicore processors
and multiple accelerators, simple offloading approaches in which the main trunk of the …

Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

SJ Pennycook, SD Hammond, SA Jarvis… - ACM SIGMETRICS …, 2011 - dl.acm.org
We present the performance analysis of a port of the LU benchmark from the NAS Parallel
Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and …

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

S Tomov, R Nath, J Dongarra - Parallel Computing, 2010 - Elsevier
We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous
multicore with GPU accelerators that can exceed 25× the performance of the corresponding …

Multifrontal factorization of sparse SPD matrices on GPUs

T George, V Saxena, A Gupta, A Singh… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org
Solving large sparse linear systems is often the most computationally intensive component
of many scientific computing applications. In the past, sparse multifrontal direct factorization …

A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and Cholesky factorizations

A Haidar, A Abdelfattah, M Zounon… - … on Parallel and …, 2017 - ieeexplore.ieee.org
We present a high-performance GPU kernel with a substantial speedup over vendor
libraries for very small matrix computations. In addition, we discuss most of the challenges …