Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems

CH Chu, P Kousha, AA Awan, KS Khorassani… - Proceedings of the 34th …, 2020 - dl.acm.org
The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …

Wrht: Efficient all-reduce for distributed DNN training in optical interconnect systems

F Dai, Y Chen, Z Huang, H Zhang - Proceedings of the 52nd …, 2023 - dl.acm.org
Communication efficiency is crucial for accelerating distributed deep neural network (DNN)
training. All-reduce, a vital communication primitive, is responsible for reducing model …

Accelerating distributed deep neural network training with pipelined MPI allreduce

A Castelló, ES Quintana-Ortí, J Duato - Cluster Computing, 2021 - Springer
TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package
to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn …

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

T Thao Nguyen, M Wahib… - … and Computation: Practice …, 2021 - Wiley Online Library
Training models on large‐scale GPUs‐accelerated clusters are becoming a commonplace
due to the increase in complexity and size in deep learning models. One of the main …

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

TT Nguyen, M Wahib - 2021 IEEE/ACM 21st International …, 2021 - ieeexplore.ieee.org
Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …

2D-HRA: Two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning

Y Jiang, H Gu, Y Lu, X Yu - IEEE Access, 2020 - ieeexplore.ieee.org
Gradient synchronization, a process of communication among machines in large-scale
distributed machine learning (DML), plays a crucial role in improving DML performance …

Hybrid electrical/optical switch architectures for training distributed deep learning in large-scale

TN Truong, R Takano - IEICE TRANSACTIONS on Information and …, 2021 - search.ieice.org
Data parallelism is the dominant method used to train deep learning (DL) models on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …

On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning

TT Nguyen, R Takano - 2019 IEEE/ACM Workshop on …, 2019 - ieeexplore.ieee.org
Data parallelism is the dominant method used to train deep learning (DL) model on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …

Topology-aware sparse allreduce for large-scale deep learning

TT Nguyen, M Wahib, R Takano - 2019 IEEE 38th International …, 2019 - ieeexplore.ieee.org
Data parallelism is the dominant method used to scale-up deep learning (DL) training
across multiple compute nodes. Collective communication of the local gradients between …

COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems

C Sun, H Luo, H Jiang, J Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions
of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by …