Recursive blocked algorithms and hybrid data structures for dense matrix library software
Matrix computations are both fundamental and ubiquitous in computational science and its
vast application areas. Along with the development of more advanced computer systems …
vast application areas. Along with the development of more advanced computer systems …
[KIRJA][B] Automatic performance tuning of sparse matrix kernels
RW Vuduc - 2003 - search.proquest.com
This dissertation presents an automated system to generate highly efficient, platform-
adapted implementations of sparse matrix kernels. We show that conventional …
adapted implementations of sparse matrix kernels. We show that conventional …
Tiling optimizations for 3D scientific computations
G Rivera, CW Tseng - SC'00: Proceedings of the 2000 ACM …, 2000 - ieeexplore.ieee.org
Compiler transformations can significantly improve data locality for many scientific programs.
In this paper, we show iterative solvers for partial differential equations (PDEs) in three …
In this paper, we show iterative solvers for partial differential equations (PDEs) in three …
Program locality analysis using reuse distance
On modern computer systems, the memory performance of an application depends on its
locality. For a single execution, locality-correlated measures like average miss rate or …
locality. For a single execution, locality-correlated measures like average miss rate or …
Single Assignment C: efficient support for high-level array operations in a functional setting
SB Scholz - Journal of functional programming, 2003 - cambridge.org
This paper presents a novel approach for integrating arrays with access time (1) into
functional languages. It introduces n-dimensional arrays combined with a type system that …
functional languages. It introduces n-dimensional arrays combined with a type system that …
Heap data allocation to scratch-pad memory in embedded systems
A Dominguez, S Udayakumaran… - Journal of Embedded …, 2005 - content.iospress.com
This paper presents the first-ever compile-time method for allocating a portion of the heap
data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed …
data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed …
Tiling, block data layout, and memory hierarchy performance
Recently, several experimental studies have been conducted on block data layout in
conjunction with tiling as a data transformation technique to improve cache performance. In …
conjunction with tiling as a data transformation technique to improve cache performance. In …
Statistical models for empirical search-based performance tuning
Achieving peak performance from the computational kernels that dominate application
performance often requires extensive machine-dependent tuning by hand. Automatic tuning …
performance often requires extensive machine-dependent tuning by hand. Automatic tuning …
Improving effective bandwidth through compiler enhancement of global cache reuse
C Ding, K Kennedy - Journal of Parallel and Distributed Computing, 2004 - Elsevier
The performance of modern machines is increasingly limited by insufficient memory
bandwidth. One way to alleviate this bandwidth limitation for a given program is to minimize …
bandwidth. One way to alleviate this bandwidth limitation for a given program is to minimize …
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
N Ahmed, N Mateev, K **ali - … of the 14th international conference on …, 2000 - dl.acm.org
We present an approach for synthesizing transformations to enhance locality in imperfectly-
nested loops. The key idea is to embed the iteration space of every statement in a loop nest …
nested loops. The key idea is to embed the iteration space of every statement in a loop nest …