An in-depth analysis of the slingshot interconnect

D De Sensi, S Di Girolamo… - … Conference for High …, 2020 - ieeexplore.ieee.org
The interconnect is one of the most critical components in large scale computing systems,
and its impact on the performance of applications is going to increase with the system size …

Frontier: exploring exascale

S Atchley, C Zimmer, J Lange, D Bernholdt… - Proceedings of the …, 2023 - dl.acm.org
As the US Department of Energy (DOE) computing facilities began deploying petascale
systems in 2008, DOE was already setting its sights on exascale. In that year, DARPA …

Wfbench: Automated generation of scientific workflow benchmarks

T Coleman, H Casanova, K Maheshwari… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org
The prevalence of scientific workflows with high computational demands calls for their
execution on various distributed computing platforms, including large-scale leadership-class …

Exploring gpu-to-gpu communication: Insights into supercomputer interconnects

D De Sensi, L Pichetti, F Vella… - … Conference for High …, 2024 - ieeexplore.ieee.org
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale
supercomputers. On these systems, GPUs on the same node are connected through …

Disruptive changes in field equation modeling: A simple interface for wafer scale engines

M Woo, T Jordan, R Schreiber, I Sharapov… - arxiv preprint arxiv …, 2022 - arxiv.org
We present a high-level and accessible Application Programming Interface (API) for the
solution of field equations on the Cerebras Systems Wafer-Scale Engine (WSE) with over …

Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning

B Aksar, E Sencan, B Schwaller, O Aaziz… - … on Parallel and …, 2024 - ieeexplore.ieee.org
With the increasing scale and complexity of High-Performance Computing (HPC) systems,
performance variations in applications caused by anomalies have become significant …

Understanding hot interconnects with an extensive benchmark survey

Y Li, H Qi, G Lu, F **, Y Guo, X Lu - BenchCouncil Transactions on …, 2022 - Elsevier
Understanding the designs and performance characterizations of hot interconnects on
modern data center and high-performance computing (HPC) clusters is a fruitful research …

Quantifying the impact of network congestion on application performance and network metrics

Y Zhang, T Groves, B Cook, NJ Wright… - … on Cluster Computing …, 2020 - ieeexplore.ieee.org
In modern high-performance computing (HPC) systems, network congestion is an important
factor that contributes to performance degradation. However, how network congestion …

Live forensics for HPC systems: A case study on distributed storage systems

S Jha, S Cui, SS Banerjee, T Xu, J Enos… - … Conference for High …, 2020 - ieeexplore.ieee.org
Large-scale high-performance computing systems frequently experience a wide range of
failure modes, such as reliability failures (eg, hang or crash), and resource overload-related …

[HTML][HTML] An optimisation of allreduce communication in message-passing systems

A Jocksch, N Ohana, E Lanti, E Koutsaniti… - Parallel Computing, 2021 - Elsevier
Collective communication, namely the pattern allreduce in message-passing systems, is
optimised based on measurements at the installation time of the library. The algorithms used …