Flare: Flexible in-network allreduce

D De Sensi, S Di Girolamo, S Ashkboos, S Li… - Proceedings of the …, 2021 - dl.acm.org
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …

Mitigating network noise on dragonfly networks through application-aware routing

D De Sensi, S Di Girolamo, T Hoefler - Proceedings of the International …, 2019 - dl.acm.org
System noise can negatively impact the performance of HPC systems, and the
interconnection network is one of the main factors contributing to this problem. To mitigate …

Reimagining codesign for advanced scientific computing: Report for the ascr workshop on reimagining codesign

J Ang, AA Chien, SD Hammond, A Hoisie, I Karlin… - 2022 - osti.gov
In March 2021, the US Department of Energy's Advanced Scientific Computing Research
program convened the Workshop on Reimagining Codesign. The workshop, also known as …

The effect of system utilization on application performance variability

B Li, S Chunduri, K Harms, Y Fan, Z Lan - Proceedings of the 9th …, 2019 - dl.acm.org
Application performance variability caused by network contention is a major issue on
dragonfly based systems. This work-in-progress study makes two contributions. First, we …

Workload interference prevention with intelligent routing and flexible job placement on dragonfly

Y Kang, X Wang, Z Lan - Proceedings of the 2023 ACM SIGSIM …, 2023 - dl.acm.org
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens
of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources …

GRAP: group-level resource allocation policy for reconfigurable Dragonfly network in HPC

G Feng, D Dong, S Zhao, Y Lu - … of the 37th International conference on …, 2023 - dl.acm.org
Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has
been adopted in new exascale High Performance Computing (HPC) systems. However …

Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation

X Xu, X Wang, E Cruz-Camacho… - Proceedings of the …, 2023 - dl.acm.org
Interconnect networks play a key role in high-performance computing (HPC) systems.
Parallel discrete event simulation (PDES) has been a long-standing pillar for studying large …

Optimized MPI collective algorithms for dragonfly topology

G Feng, D Dong, Y Lu - Proceedings of the 36th ACM International …, 2022 - dl.acm.org
The Message Passing Interface (MPI) is the most prominent and dominant programming
model for scientific computing in super-computing systems today. Although many general …

Union: An automatic workload manager for accelerating network simulation

X Wang, M Mubarak, Y Kang… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
With the rapid growth of the machine learning applications, the workloads of future HPC
systems are anticipated to be a mix of scientific simulation, big data analytics, and machine …

Modeling and analysis of application interference on dragonfly+

Y Kang, X Wang, N McGlohon, M Mubarak… - Proceedings of the …, 2019 - dl.acm.org
Dragonfly class of networks are considered as promising interconnects for next-generation
supercomputers. While Dragonfly+ networks offer more path diversity than the original …