Benchmarking GPUs to tune dense linear algebra

V Volkov, JW Demmel - SC'08: Proceedings of the 2008 ACM …, 2008 - ieeexplore.ieee.org
We present performance results for dense linear algebra using recent NVIDIA GPUs. Our
matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's …

Towards dense linear algebra for hybrid GPU accelerated manycore systems

S Tomov, J Dongarra, M Baboulin - Parallel Computing, 2010 - Elsevier
We highlight the trends leading to the increased appeal of using hybrid multicore+ GPU
systems for high performance computing. We present a set of techniques that can be used to …

Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures

T Gautier, JVF Lima, N Maillard… - 2013 IEEE 27th …, 2013 - ieeexplore.ieee.org
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and
accelerators, like GPUs. Programming such nodes is typically based on a combination of …

Achieving a single compute device image in OpenCL for multiple GPUs

J Kim, H Kim, JH Lee, J Lee - ACM Sigplan Notices, 2011 - dl.acm.org
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats
them as a single compute device. Providing a single virtual compute device image to the …

An extension of the StarSs programming model for platforms with multiple GPUs

E Ayguadé, RM Badia, FD Igual, J Labarta… - Euro-Par 2009 Parallel …, 2009 - Springer
While general-purpose homogeneous multi-core architectures are becoming ubiquitous,
there are clear indications that, for a number of important applications, a better …

[PDF][PDF] LU, QR and Cholesky factorizations using vector capabilities of GPUs

V Volkov, J Demmel - 2008 - eecs.berkeley.edu
We present performance results for dense linear algebra using the 8-series NVIDIA GPUs.
Our matrix-matrix multiply routine (GEMM) runs 60% faster than the vendor implementation …

Communication-avoiding QR decomposition for GPUs

M Anderson, G Ballard, J Demmel… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org
We describe an implementation of the Communication-Avoiding QR (CAQR) factorization
that runs entirely on a single graphics processor (GPU). We show that the reduction in …

Overlap** communication and computation by using a hybrid MPI/SMPSs approach

V Marjanović, J Labarta, E Ayguadé… - Proceedings of the 24th …, 2010 - dl.acm.org
Communication overhead is one of the dominant factors affecting performance in high-end
computing systems. To reduce the negative impact of communication, programmers overlap …

Hierarchical dag scheduling for hybrid distributed systems

W Wu, A Bouteiller, G Bosilca… - 2015 IEEE …, 2015 - ieeexplore.ieee.org
Accelerator-enhanced computing platforms have drawn a lot of attention due to their
massive peak commutational capacity. Despite significant advances in the programming …

The libflame library for dense matrix computations

FG Van Zee, E Chan, RA Van de Geijn… - … in science & …, 2009 - ieeexplore.ieee.org
Researchers from the Formal Linear Algebra Method Environment (Flame) project have
developed new methodologies for analyzing, designing, and implementing linear algebra …