An in-depth analysis of the slingshot interconnect
The interconnect is one of the most critical components in large scale computing systems,
and its impact on the performance of applications is going to increase with the system size …
and its impact on the performance of applications is going to increase with the system size …
Flare: Flexible in-network allreduce
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …
distributed applications. To improve its bandwidth and to reduce network traffic, this …
Noise in the clouds: Influence of network performance variability on application scalability
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC
workloads on the best-fitting hardware. However, although cloud and on-premise HPC …
workloads on the best-fitting hardware. However, although cloud and on-premise HPC …
Study of workload interference with intelligent routing on dragonfly
Dragonfly interconnect is a crucial network technol-ogy for supercomputers. To support
exascale systems, network resources are shared such that links and routers are not …
exascale systems, network resources are shared such that links and routers are not …
The case of performance variability on dragonfly-based systems
Performance of a parallel code running on a large supercomputer can vary significantly from
one run to another even when the executable and its input parameters are left unchanged …
one run to another even when the executable and its input parameters are left unchanged …
Mitigating network noise on dragonfly networks through application-aware routing
System noise can negatively impact the performance of HPC systems, and the
interconnection network is one of the main factors contributing to this problem. To mitigate …
interconnection network is one of the main factors contributing to this problem. To mitigate …
Exploring gpu-to-gpu communication: Insights into supercomputer interconnects
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale
supercomputers. On these systems, GPUs on the same node are connected through …
supercomputers. On these systems, GPUs on the same node are connected through …
HyperX topology: First at-scale implementation and comparison to the fat-tree
The de-facto standard topology for modern HPC systems and data-centers are Folded Clos
networks, commonly known as Fat-Trees. The number of network endpoints in these …
networks, commonly known as Fat-Trees. The number of network endpoints in these …
Workload interference prevention with intelligent routing and flexible job placement on dragonfly
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens
of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources …
of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources …
A deep reinforcement learning-based optimization approach for containerized microservice scheduling in Hybrid Fog/Cloud environments
The deployment of microservices in Hybrid Fog/Cloud (HFC) environments for Internet of
Things (IoT) applications presents a significant challenge in efficiently scheduling …
Things (IoT) applications presents a significant challenge in efficiently scheduling …