3.5-D blocking optimization for stencil computations on modern CPUs and GPUs

A Nguyen, N Satish, J Chhugani… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-
neighbor computations. The bandwidth-to-compute requirement for a large class of stencil …

Tiling stencil computations to maximize parallelism

V Bandishti, I Pananilath… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org
Most stencil computations allow tile-wise concurrent start, ie, there always exists a face of
the iteration space and a set of tiling hyperplanes such that all tiles along that face can be …

High performance stencil code generation with lift

B Hagedorn, L Stoltzfus, M Steuwer… - Proceedings of the …, 2018 - dl.acm.org
Stencil computations are widely used from physical simulations to machine-learning. They
are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing …

Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

J Meng, K Skadron - Proceedings of the 23rd international conference …, 2009 - dl.acm.org
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known
technique to localize their computation. When ISLs are tiled across a parallel architecture …

Automatic data movement and computation map** for multi-level parallel architectures with explicitly managed memories

MM Baskaran, U Bondhugula… - Proceedings of the 13th …, 2008 - dl.acm.org
Several parallel architectures such as GPUs and the Cell processor have fast explicitly
managed on-chip memories, in addition to slow off-chip memory. They also have very high …

Cache accurate time skewing in iterative stencil computations

R Strzodka, M Shaheen, D Pajak… - … Conference on Parallel …, 2011 - ieeexplore.ieee.org
We present a time skewing algorithm that breaks the memory wall for certain iterative stencil
computations. A stencil computation, even with constant weights, is a completely memory …

Multi-level tiling: M for the price of one

DG Kim, L Renganarayanan, D Rostron… - Proceedings of the …, 2007 - dl.acm.org
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data
locality. High-performance implementations use multiple levels of tiling to exploit the …

Optimization principles for collective neighborhood communications

T Hoefler, T Schneider - SC'12: Proceedings of the …, 2012 - ieeexplore.ieee.org
Many scientific applications operate in a bulk-synchronous mode of iterative communication
and computation steps. Even though the communication steps happen at the same logical …

On how to accelerate iterative stencil loops: a scalable streaming-based approach

R Cattaneo, G Natale, C Sicignano, D Sciuto… - ACM Transactions on …, 2015 - dl.acm.org
In high-performance systems, stencil computations play a crucial role as they appear in a
variety of different fields of application, ranging from partial differential equation solving, to …

A performance study for iterative stencil loops on GPUs with ghost zone optimizations

J Meng, K Skadron - International Journal of Parallel Programming, 2011 - Springer
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known
technique to localize their computation. When ISLs are tiled across a parallel architecture …