Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems
The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …
Wrht: Efficient all-reduce for distributed DNN training in optical interconnect systems
Communication efficiency is crucial for accelerating distributed deep neural network (DNN)
training. All-reduce, a vital communication primitive, is responsible for reducing model …
training. All-reduce, a vital communication primitive, is responsible for reducing model …
Accelerating distributed deep neural network training with pipelined MPI allreduce
TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package
to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn …
to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn …
Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters
Training models on large‐scale GPUs‐accelerated clusters are becoming a commonplace
due to the increase in complexity and size in deep learning models. One of the main …
due to the increase in complexity and size in deep learning models. One of the main …
An allreduce algorithm and network co-design for large-scale training of distributed deep learning
Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …
2D-HRA: Two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning
Y Jiang, H Gu, Y Lu, X Yu - IEEE Access, 2020 - ieeexplore.ieee.org
Gradient synchronization, a process of communication among machines in large-scale
distributed machine learning (DML), plays a crucial role in improving DML performance …
distributed machine learning (DML), plays a crucial role in improving DML performance …
Hybrid electrical/optical switch architectures for training distributed deep learning in large-scale
Data parallelism is the dominant method used to train deep learning (DL) models on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …
Performance Computing systems such as large-scale GPU clusters. When training a DL …
On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning
Data parallelism is the dominant method used to train deep learning (DL) model on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …
Performance Computing systems such as large-scale GPU clusters. When training a DL …
Topology-aware sparse allreduce for large-scale deep learning
Data parallelism is the dominant method used to scale-up deep learning (DL) training
across multiple compute nodes. Collective communication of the local gradients between …
across multiple compute nodes. Collective communication of the local gradients between …
COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems
In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions
of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by …
of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by …