UPC++: a PGAS extension for C++

Y Zheng, A Kamil, MB Driscoll, H Shan… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org
Partitioned Global Address Space (PGAS) languages are convenient for expressing
algorithms with large, random-access data, and they have proven to provide high …

Sequoia: Programming the memory hierarchy

K Fatahalian, DR Horn, TJ Knight, L Leem… - Proceedings of the …, 2006 - dl.acm.org
We present Sequoia, a programming language designed to facilitate the development of
memory hierarchy aware parallel programs that remain portable across modern machines …

SPIRAL: Extreme performance portability

F Franchetti, TM Low, DT Popovici… - Proceedings of the …, 2018 - ieeexplore.ieee.org
In this paper, we address the question of how to automatically map computational kernels to
highly efficient code for a wide range of computing platforms and establish the correctness of …

Trends in data locality abstractions for HPC systems

D Unat, A Dubey, T Hoefler, J Shalf… - … on Parallel and …, 2017 - ieeexplore.ieee.org
The cost of data movement has always been an important concern in high performance
computing (HPC) systems. It has now become the dominant factor in terms of both energy …

Exascale computing trends: Adjusting to the" new normal"'for computer architecture

P Kogge, J Shalf - Computing in Science & Engineering, 2013 - ieeexplore.ieee.org
We now have 20 years of data under our belt about the performance of supercomputers
against at least a single floating-point benchmark from dense linear algebra. Until about …

UPC++: A high-performance communication framework for asynchronous computation

J Bachan, SB Baden, S Hofmeyr… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
UPC++ is a C++ library that supports high-performance computation via an asynchronous
communication framework. This paper describes a new incarnation that differs substantially …

The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org
Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

Runnemede: An architecture for ubiquitous high-performance computing

NP Carter, A Agrawal, S Borkar… - 2013 IEEE 19th …, 2013 - ieeexplore.ieee.org
DARPA's Ubiquitous High-Performance Computing (UHPC) program asked researchers to
develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt …

Partitioning streaming parallelism for multi-cores: a machine learning based approach

Z Wang, MFP O'Boyle - Proceedings of the 19th international conference …, 2010 - dl.acm.org
Stream based languages are a popular approach to expressing parallelism in modern
applications. The efficient map** of streaming parallelism to multi-core processors is …

Hierarchical place trees: A portable abstraction for task parallelism and data movement

Y Yan, J Zhao, Y Guo, V Sarkar - … , LCPC 2009, Newark, DE, USA, October …, 2010 - Springer
Modern computer systems feature multiple homogeneous or heterogeneous computing
units with deep memory hierarchies, and expect a high degree of thread-level parallelism …