A survey of CPU-GPU heterogeneous computing techniques
As both CPUs and GPUs become employed in a wide range of applications, it has been
acknowledged that both of these Processing Units (PUs) have their unique features and …
acknowledged that both of these Processing Units (PUs) have their unique features and …
Kernel methods through the roof: handling billions of points efficiently
Kernel methods provide an elegant and principled approach to nonparametric learning, but
so far could hardly be used in large scale problems, since naïve implementations scale …
so far could hardly be used in large scale problems, since naïve implementations scale …
Dense linear algebra solvers for multicore with GPU accelerators
S Tomov, R Nath, H Ltaief… - 2010 IEEE International …, 2010 - ieeexplore.ieee.org
Solving dense linear systems of equations is a fundamental problem in scientific computing.
Numerical simulations involving complex systems represented in terms of unknown …
Numerical simulations involving complex systems represented in terms of unknown …
PCBDDC: a class of robust dual-primal methods in PETSc
S Zampini - SIAM Journal on Scientific Computing, 2016 - SIAM
A class of preconditioners based on balancing domain decomposition by constraints
methods is introduced in the Portable, Extensible Toolkit for Scientific Computation (PETSc) …
methods is introduced in the Portable, Extensible Toolkit for Scientific Computation (PETSc) …
[PDF][PDF] Keeneland: Bringing heterogeneous GPU computing to the computational science community
The Keeneland project—named for a historic thoroughbred horse racing track in Lexington,
Kentucky—is a five-year Track 2D grant awarded by the US National Science Foundation …
Kentucky—is a five-year Track 2D grant awarded by the US National Science Foundation …
Data-aware task scheduling on multi-accelerator based platforms
To fully tap into the potential of heterogeneous machines composed of multicore processors
and multiple accelerators, simple offloading approaches in which the main trunk of the …
and multiple accelerators, simple offloading approaches in which the main trunk of the …
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
We present the performance analysis of a port of the LU benchmark from the NAS Parallel
Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and …
Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and …
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
S Tomov, R Nath, J Dongarra - Parallel Computing, 2010 - Elsevier
We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous
multicore with GPU accelerators that can exceed 25× the performance of the corresponding …
multicore with GPU accelerators that can exceed 25× the performance of the corresponding …
Multifrontal factorization of sparse SPD matrices on GPUs
Solving large sparse linear systems is often the most computationally intensive component
of many scientific computing applications. In the past, sparse multifrontal direct factorization …
of many scientific computing applications. In the past, sparse multifrontal direct factorization …
A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and Cholesky factorizations
We present a high-performance GPU kernel with a substantial speedup over vendor
libraries for very small matrix computations. In addition, we discuss most of the challenges …
libraries for very small matrix computations. In addition, we discuss most of the challenges …