{TopoOpt}: Co-optimizing network topology and parallelization strategy for distributed training jobs

W Wang, M Khazraee, Z Zhong, M Ghobadi… - … USENIX Symposium on …, 2023 - usenix.org
We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training
workloads. TopoOpt co-optimizes the distributed training process across three dimensions …

Communication optimization strategies for distributed deep neural network training: A survey

S Ouyang, D Dong, Y Xu, L **ao - Journal of parallel and distributed …, 2021 - Elsevier
Recent trends in high-performance computing and deep learning have led to the
proliferation of studies on large-scale deep neural network training. However, the frequent …

Communication optimization algorithms for distributed deep learning systems: A survey

E Yu, D Dong, X Liao - IEEE Transactions on Parallel and …, 2023 - ieeexplore.ieee.org
Deep learning's widespread adoption in various fields has made distributed training across
multiple computing nodes essential. However, frequent communication between nodes can …

Communication-efficient ADMM-based distributed algorithms for sparse training

G Wang, Y Lei, Y Qiu, L Lou, Y Li - Neurocomputing, 2023 - Elsevier
In large-scale distributed machine learning (DML), the synchronization efficiency of the
distributed algorithm becomes a critical factor that affects the training time of machine …

A Generic, High-Performance, Compression-Aware Framework for Data Parallel DNN Training

H Wu, S Wang, Y Bai, C Li, Q Zhou, J Yi… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Gradient compression is a promising approach to alleviating the communication bottleneck
in data parallel deep neural network (DNN) training by significantly reducing the data …

HSAC-ALADMM: an asynchronous lazy ADMM algorithm based on hierarchical sparse allreduce communication

D Wang, Y Lei, J **e, G Wang - The Journal of Supercomputing, 2021 - Springer
The distributed alternating direction method of multipliers (ADMM) is an effective algorithm
for solving large-scale optimization problems. However, its high communication cost limits its …

COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems

C Sun, H Luo, H Jiang, J Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions
of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by …

Modeling and Simulation of Collective Algorithms on HPC Network Topologies using Structural Simulation Toolkit

SP Chenna, M Steyer, N Kumar… - SC24-W: Workshops …, 2024 - ieeexplore.ieee.org
In the last decade, DL training has emerged as an HPC-scale workload running on large
clusters, the size of the largest supercomputers on the Top500 list. The dominant …

Error Permissive Computing: for Post Moore's Computer System Design

R Takano, T Hirofuchi, M Wahib, TT Nguyen… - error-permissive-computing.github.io
We are exploring a new concept of error permissive computing that improves the capability
and capacity while drastically reducing power consumption. More specifically, we …