- Academic Search

CH Chu, P Kousha, AA Awan, KS Khorassani… - Proceedings of the 34th …, 2020 - dl.acm.org

The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …

保存引用被引用数: 50 関連記事全 4 バージョン

Wrht: Efficient all-reduce for distributed DNN training in optical interconnect systems

F Dai, Y Chen, Z Huang, H Zhang - Proceedings of the 52nd …, 2023 - dl.acm.org

Communication efficiency is crucial for accelerating distributed deep neural network (DNN)
training. All-reduce, a vital communication primitive, is responsible for reducing model …

保存引用被引用数: 5 関連記事全 2 バージョン

[Free GPT-4]

[PDF] springer.com

Accelerating distributed deep neural network training with pipelined MPI allreduce

A Castelló, ES Quintana-Ortí, J Duato - Cluster Computing, 2021 - Springer

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package
to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn …

保存引用被引用数: 12 関連記事全 6 バージョン

[Free GPT-4]

[PDF] wiley.com

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

T Thao Nguyen, M Wahib… - … and Computation: Practice …, 2021 - Wiley Online Library

Training models on large‐scale GPUs‐accelerated clusters are becoming a commonplace
due to the increase in complexity and size in deep learning models. One of the main …

保存引用被引用数: 17 関連記事

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

TT Nguyen, M Wahib - 2021 IEEE/ACM 21st International …, 2021 - ieeexplore.ieee.org

Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …

保存引用被引用数: 9 関連記事全 2 バージョン

[Free GPT-4]

[PDF] ieee.org

2D-HRA: Two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning

Y Jiang, H Gu, Y Lu, X Yu - IEEE Access, 2020 - ieeexplore.ieee.org

Gradient synchronization, a process of communication among machines in large-scale
distributed machine learning (DML), plays a crucial role in improving DML performance …

保存引用被引用数: 14 関連記事全 3 バージョン

[Free GPT-4]

[PDF] jst.go.jp

Hybrid electrical/optical switch architectures for training distributed deep learning in large-scale

TN Truong, R Takano - IEICE TRANSACTIONS on Information and …, 2021 - search.ieice.org

Data parallelism is the dominant method used to train deep learning (DL) models on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …

保存引用被引用数: 9 関連記事全 6 バージョン

On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning

TT Nguyen, R Takano - 2019 IEEE/ACM Workshop on …, 2019 - ieeexplore.ieee.org

Data parallelism is the dominant method used to train deep learning (DL) model on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …

保存引用被引用数: 10 関連記事全 3 バージョン

[Free GPT-4]

[PDF] researchgate.net

Topology-aware sparse allreduce for large-scale deep learning

TT Nguyen, M Wahib, R Takano - 2019 IEEE 38th International …, 2019 - ieeexplore.ieee.org

Data parallelism is the dominant method used to scale-up deep learning (DL) training
across multiple compute nodes. Collective communication of the local gradients between …

保存引用被引用数: 9 関連記事全 3 バージョン

[Free GPT-4]

[PDF] uta.edu

COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems

C Sun, H Luo, H Jiang, J Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions
of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by …

保存引用被引用数: 1 関連記事全 5 バージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Hierarchical distributed-memory multi-leader mpi-allreduce for deep learning workloads

Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems

Wrht: Efficient all-reduce for distributed DNN training in optical interconnect systems

Accelerating distributed deep neural network training with pipelined MPI allreduce

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

2D-HRA: Two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning

Hybrid electrical/optical switch architectures for training distributed deep learning in large-scale

On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning

Topology-aware sparse allreduce for large-scale deep learning

COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems