An in-depth analysis of the slingshot interconnect

D De Sensi, S Di Girolamo… - … Conference for High …, 2020‏ - ieeexplore.ieee.org
The interconnect is one of the most critical components in large scale computing systems,
and its impact on the performance of applications is going to increase with the system size …

A large-scale study of MPI usage in open-source HPC applications

I Laguna, R Marshall, K Mohror, M Ruefenacht… - Proceedings of the …, 2019‏ - dl.acm.org
Understanding the state-of-the-practice in MPI usage is paramount for many aspects of
supercomputing, including optimizing the communication of HPC applications and informing …

Flare: Flexible in-network allreduce

D De Sensi, S Di Girolamo, S Ashkboos, S Li… - Proceedings of the …, 2021‏ - dl.acm.org
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

SU Noh, J Hong, C Lim, S Park, J Kim… - 2024 ACM/IEEE 51st …, 2024‏ - ieeexplore.ieee.org
Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory
(PIM) by associating their memory banks with processing elements (PEs), allowing …

Near-optimal wafer-scale reduce

P Luczynski, L Gianinazzi, P Iff, L Wilson… - Proceedings of the 33rd …, 2024‏ - dl.acm.org
Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-
performance computing (HPC) applications. We present the first systematic investigation of …

gzccl: Compression-accelerated collective communication framework for gpu clusters

J Huang, S Di, X Yu, Y Zhai, J Liu, Y Huang… - Proceedings of the 38th …, 2024‏ - dl.acm.org
GPU-aware collective communication has become a major bottleneck for modern computing
platforms as GPU computing power rapidly rises. A traditional approach is to directly …

Understanding the use of message passing interface in exascale proxy applications

N Sultana, M Rüfenacht, A Skjellum… - Concurrency and …, 2021‏ - Wiley Online Library
Summary The Exascale Computing Project (ECP) focuses on the development of future
exascale‐capable applications. Most ECP applications use the message passing interface …

RAMP: a flat nanosecond optical network and MPI operations for distributed deep learning systems

A Ottino, J Benjamin, G Zervas - Optical Switching and Networking, 2024‏ - Elsevier
Distributed deep learning (DDL) systems strongly depend on network performance. Current
electronic packet switched (EPS) network architectures and technologies suffer from …

Characterization and identification of HPC applications at leadership computing facility

Z Liu, R Lewis, R Kettimuthu, K Harms… - Proceedings of the 34th …, 2020‏ - dl.acm.org
High Performance Computing (HPC) is an important method for scientific discovery via large-
scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers …

Swing: Short-cutting rings for higher bandwidth allreduce

D De Sensi, T Bonato, D Saam, T Hoefler - 21st USENIX Symposium on …, 2024‏ - usenix.org
The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …