Recursive blocked algorithms and hybrid data structures for dense matrix library software

E Elmroth, F Gustavson, I Jonsson, B Kågström - SIAM review, 2004 - SIAM
Matrix computations are both fundamental and ubiquitous in computational science and its
vast application areas. Along with the development of more advanced computer systems …

[KIRJA][B] Automatic performance tuning of sparse matrix kernels

RW Vuduc - 2003 - search.proquest.com
This dissertation presents an automated system to generate highly efficient, platform-
adapted implementations of sparse matrix kernels. We show that conventional …

Tiling optimizations for 3D scientific computations

G Rivera, CW Tseng - SC'00: Proceedings of the 2000 ACM …, 2000 - ieeexplore.ieee.org
Compiler transformations can significantly improve data locality for many scientific programs.
In this paper, we show iterative solvers for partial differential equations (PDEs) in three …

Program locality analysis using reuse distance

Y Zhong, X Shen, C Ding - ACM Transactions on Programming …, 2009 - dl.acm.org
On modern computer systems, the memory performance of an application depends on its
locality. For a single execution, locality-correlated measures like average miss rate or …

Single Assignment C: efficient support for high-level array operations in a functional setting

SB Scholz - Journal of functional programming, 2003 - cambridge.org
This paper presents a novel approach for integrating arrays with access time (1) into
functional languages. It introduces n-dimensional arrays combined with a type system that …

Heap data allocation to scratch-pad memory in embedded systems

A Dominguez, S Udayakumaran… - Journal of Embedded …, 2005 - content.iospress.com
This paper presents the first-ever compile-time method for allocating a portion of the heap
data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed …

Tiling, block data layout, and memory hierarchy performance

N Park, B Hong, VK Prasanna - IEEE Transactions on Parallel …, 2003 - ieeexplore.ieee.org
Recently, several experimental studies have been conducted on block data layout in
conjunction with tiling as a data transformation technique to improve cache performance. In …

Statistical models for empirical search-based performance tuning

R Vuduc, JW Demmel… - The International Journal …, 2004 - journals.sagepub.com
Achieving peak performance from the computational kernels that dominate application
performance often requires extensive machine-dependent tuning by hand. Automatic tuning …

Improving effective bandwidth through compiler enhancement of global cache reuse

C Ding, K Kennedy - Journal of Parallel and Distributed Computing, 2004 - Elsevier
The performance of modern machines is increasingly limited by insufficient memory
bandwidth. One way to alleviate this bandwidth limitation for a given program is to minimize …

Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

N Ahmed, N Mateev, K **ali - … of the 14th international conference on …, 2000 - dl.acm.org
We present an approach for synthesizing transformations to enhance locality in imperfectly-
nested loops. The key idea is to embed the iteration space of every statement in a loop nest …