- Academic Search

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020 - usenix.org

Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

Save Cite Cited by 346 Related articles All 10 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] usenix.org

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org

Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

Save Cite Cited by 494 Related articles All 19 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] usenix.org

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org

Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

Save Cite Cited by 447 Related articles All 13 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] usenix.org

{ATP}: In-network aggregation for multi-tenant learning

CL Lao, Y Le, K Mahajan, Y Chen, W Wu… - … USENIX Symposium on …, 2021 - usenix.org

Distributed deep neural network training (DT) systems are widely deployed in clusters where
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …

Save Cite Cited by 254 Related articles All 11 versions Free GPT-4 View as HTML

In-network aggregation for data center networks: A survey

A Feng, D Dong, F Lei, J Ma, E Yu, R Wang - Computer Communications, 2023 - Elsevier

Aggregation applications are widely deployed in data centers, such as distributed machine
learning and MapReduce-like framework. These applications typically have large …

Save Cite Cited by 14 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] mlsys.org

Distributed hierarchical gpu parameter server for massive scale deep learning ads systems

W Zhao, D **e, R Jia, Y Qian, R Ding… - … of Machine Learning …, 2020 - proceedings.mlsys.org

Neural networks of ads systems usually take input from multiple resources, eg query-ad
relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot …

Save Cite Cited by 167 Related articles All 8 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] kaust.edu.sa

Grace: A compressed communication framework for distributed machine learning

H Xu, CY Ho, AM Abdelmoniem, A Dutta… - 2021 IEEE 41st …, 2021 - ieeexplore.ieee.org

Powerful computer clusters are used nowadays to train complex deep neural networks
(DNN) on large datasets. Distributed training increasingly becomes communication bound …

Save Cite Cited by 102 Related articles All 9 versions Free GPT-4

[Free GPT-4]

[PDF] mlsys.org

Priority-based parameter propagation for distributed DNN training

A Jayarajan, J Wei, G Gibson… - Proceedings of …, 2019 - proceedings.mlsys.org

Data parallel training is widely used for scaling distributed deep neural network (DNN)
training. However, the performance benefits are often limited by the communication-heavy …

Save Cite Cited by 195 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] acm.org

Efficient sparse collective communication and its application to accelerate distributed deep learning

J Fei, CY Ho, AN Sahu, M Canini, A Sapio - Proceedings of the 2021 …, 2021 - dl.acm.org

Efficient collective communication is crucial to parallel-computing applications such as
distributed training of large-scale recommendation systems and natural language …

Save Cite Cited by 102 Related articles All 7 versions Free GPT-4

Accelerating decentralized federated learning in heterogeneous edge computing

L Wang, Y Xu, H Xu, M Chen… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

In edge computing (EC), federated learning (FL) enables massive devices to collaboratively
train AI models without exposing local data. In order to avoid the possible bottleneck of the …

Save Cite Cited by 63 Related articles All 3 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Parameter hub: a rack-scale parameter server for distributed deep neural network training

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Scaling distributed machine learning with {In-Network} aggregation

Tiresias: A {GPU} cluster manager for distributed deep learning

{ATP}: In-network aggregation for multi-tenant learning

In-network aggregation for data center networks: A survey

Distributed hierarchical gpu parameter server for massive scale deep learning ads systems

Grace: A compressed communication framework for distributed machine learning

Priority-based parameter propagation for distributed DNN training

Efficient sparse collective communication and its application to accelerate distributed deep learning

Accelerating decentralized federated learning in heterogeneous edge computing