Flare: Flexible in-network allreduce
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …
distributed applications. To improve its bandwidth and to reduce network traffic, this …
Mitigating network noise on dragonfly networks through application-aware routing
System noise can negatively impact the performance of HPC systems, and the
interconnection network is one of the main factors contributing to this problem. To mitigate …
interconnection network is one of the main factors contributing to this problem. To mitigate …
Reimagining codesign for advanced scientific computing: Report for the ascr workshop on reimagining codesign
In March 2021, the US Department of Energy's Advanced Scientific Computing Research
program convened the Workshop on Reimagining Codesign. The workshop, also known as …
program convened the Workshop on Reimagining Codesign. The workshop, also known as …
The effect of system utilization on application performance variability
Application performance variability caused by network contention is a major issue on
dragonfly based systems. This work-in-progress study makes two contributions. First, we …
dragonfly based systems. This work-in-progress study makes two contributions. First, we …
Workload interference prevention with intelligent routing and flexible job placement on dragonfly
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens
of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources …
of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources …
GRAP: group-level resource allocation policy for reconfigurable Dragonfly network in HPC
Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has
been adopted in new exascale High Performance Computing (HPC) systems. However …
been adopted in new exascale High Performance Computing (HPC) systems. However …
Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation
X Xu, X Wang, E Cruz-Camacho… - Proceedings of the …, 2023 - dl.acm.org
Interconnect networks play a key role in high-performance computing (HPC) systems.
Parallel discrete event simulation (PDES) has been a long-standing pillar for studying large …
Parallel discrete event simulation (PDES) has been a long-standing pillar for studying large …
Optimized MPI collective algorithms for dragonfly topology
The Message Passing Interface (MPI) is the most prominent and dominant programming
model for scientific computing in super-computing systems today. Although many general …
model for scientific computing in super-computing systems today. Although many general …
Union: An automatic workload manager for accelerating network simulation
With the rapid growth of the machine learning applications, the workloads of future HPC
systems are anticipated to be a mix of scientific simulation, big data analytics, and machine …
systems are anticipated to be a mix of scientific simulation, big data analytics, and machine …
Modeling and analysis of application interference on dragonfly+
Dragonfly class of networks are considered as promising interconnects for next-generation
supercomputers. While Dragonfly+ networks offer more path diversity than the original …
supercomputers. While Dragonfly+ networks offer more path diversity than the original …