Flare: Flexible in-network allreduce

D De Sensi, S Di Girolamo, S Ashkboos, S Li… - Proceedings of the …, 2021 - dl.acm.org
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …

Noise in the clouds: Influence of network performance variability on application scalability

D De Sensi, T De Matteis, K Taranov… - Proceedings of the …, 2022 - dl.acm.org
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC
workloads on the best-fitting hardware. However, although cloud and on-premise HPC …

Study of workload interference with intelligent routing on dragonfly

Y Kang, X Wang, Z Lan - SC22: International Conference for …, 2022 - ieeexplore.ieee.org
Dragonfly interconnect is a crucial network technol-ogy for supercomputers. To support
exascale systems, network resources are shared such that links and routers are not …

mpi4py. futures: MPI-based asynchronous task execution for Python

M Rogowski, S Aseeri, D Keyes… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
We present mpi4py. futures, a lightweight, asynchronous task execution framework targeting
the Python programming language and using the Message Passing Interface (MPI) for …

Prodigy: Towards unsupervised anomaly detection in production hpc systems

B Aksar, E Sencan, B Schwaller, O Aaziz… - Proceedings of the …, 2023 - dl.acm.org
Performance variations caused by anomalies in modern High Performance Computing
(HPC) systems lead to decreased efficiency, impaired application performance, and …

Gpcnet: Designing a benchmark suite for inducing and measuring contention in hpc networks

S Chunduri, T Groves, P Mendygral, B Austin… - Proceedings of the …, 2019 - dl.acm.org
Network congestion is one of the biggest problems facing HPC systems today, affecting
system throughput, performance, user experience, and reproducibility. Congestion manifests …

Q-adaptive: A multi-agent reinforcement learning based routing on dragonfly network

Y Kang, X Wang, Z Lan - … of the 30th International Symposium on High …, 2021 - dl.acm.org
High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to
balance network traffic for optimum performance. Ideally, adaptive routing attempts to …

Workload interference prevention with intelligent routing and flexible job placement on dragonfly

Y Kang, X Wang, Z Lan - Proceedings of the 2023 ACM SIGSIM …, 2023 - dl.acm.org
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens
of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources …

Unlocking massively parallel spectral proper orthogonal decompositions in the PySPOD package

M Rogowski, BCY Yeung, OT Schmidt, R Maulik… - Computer Physics …, 2024 - Elsevier
We propose a parallel (distributed) version of the spectral proper orthogonal decomposition
(SPOD) technique. The parallel SPOD algorithm distributes the spatial dimension of the …

Mitigating network noise on dragonfly networks through application-aware routing

D De Sensi, S Di Girolamo, T Hoefler - Proceedings of the International …, 2019 - dl.acm.org
System noise can negatively impact the performance of HPC systems, and the
interconnection network is one of the main factors contributing to this problem. To mitigate …