Pytorch distributed: Experiences on accelerating data parallel training

S Li, Y Zhao, R Varma, O Salpekar, P Noordhuis… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper presents the design, implementation, and evaluation of the PyTorch distributed
data parallel module. PyTorch is a widely-adopted scientific computing package used in …

A comprehensive survey on training acceleration for large machine learning models in IoT

H Wang, Z Qu, Q Zhou, H Zhang, B Luo… - IEEE Internet of …, 2021 - ieeexplore.ieee.org
The ever-growing artificial intelligence (AI) applications have greatly reshaped our world in
many areas, eg, smart home, computer vision, natural language processing, etc. Behind …

{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024 - usenix.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

Accelerating distributed {MoE} training and inference with lina

J Li, Y Jiang, Y Zhu, C Wang, H Xu - 2023 USENIX Annual Technical …, 2023 - usenix.org
Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …

Machine learning in real-time Internet of Things (IoT) systems: A survey

J Bian, A Al Arafat, H **ong, J Li, L Li… - IEEE Internet of …, 2022 - ieeexplore.ieee.org
Over the last decade, machine learning (ML) and deep learning (DL) algorithms have
significantly evolved and been employed in diverse applications, such as computer vision …

Efficient training of large language models on distributed infrastructures: a survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Parallelizing DNN training on GPUs: Challenges and opportunities

W Xu, Y Zhang, X Tang - … Proceedings of the Web Conference 2021, 2021 - dl.acm.org
In recent years, Deep Neural Networks (DNNs) have emerged as a widely adopted
approach in many application domains. Training DNN models is also becoming a significant …

Elastic parameter server load distribution in deep learning clusters

Y Chen, Y Peng, Y Bao, C Wu, Y Zhu… - Proceedings of the 11th …, 2020 - dl.acm.org
In distributed DNN training, parameter servers (PS) can become performance bottlenecks
due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …

Robust searching-based gradient collaborative management in intelligent transportation system

H Shi, H Wang, R Ma, Y Hua, T Song, H Gao… - ACM Transactions on …, 2023 - dl.acm.org
With the rapid development of big data and the Internet of Things (IoT), traffic data from an
Intelligent Transportation System (ITS) is becoming more and more accessible. To …

{SHADE}: Enable Fundamental Cacheability for Distributed Deep Learning Training

RIS Khan, AH Yazdani, Y Fu, AK Paul, B Ji… - … USENIX Conference on …, 2023 - usenix.org
Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose
new challenges for storage system design. DLT is I/O intensive since data samples need to …