Pytorch distributed: Experiences on accelerating data parallel training
This paper presents the design, implementation, and evaluation of the PyTorch distributed
data parallel module. PyTorch is a widely-adopted scientific computing package used in …
data parallel module. PyTorch is a widely-adopted scientific computing package used in …
A comprehensive survey on training acceleration for large machine learning models in IoT
The ever-growing artificial intelligence (AI) applications have greatly reshaped our world in
many areas, eg, smart home, computer vision, natural language processing, etc. Behind …
many areas, eg, smart home, computer vision, natural language processing, etc. Behind …
{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …
deploying MegaScale, a production system for training large language models (LLMs) at the …
Accelerating distributed {MoE} training and inference with lina
Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …
Machine learning in real-time Internet of Things (IoT) systems: A survey
Over the last decade, machine learning (ML) and deep learning (DL) algorithms have
significantly evolved and been employed in diverse applications, such as computer vision …
significantly evolved and been employed in diverse applications, such as computer vision …
Efficient training of large language models on distributed infrastructures: a survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …
their sophisticated capabilities. Training these models requires vast GPU clusters and …
Parallelizing DNN training on GPUs: Challenges and opportunities
In recent years, Deep Neural Networks (DNNs) have emerged as a widely adopted
approach in many application domains. Training DNN models is also becoming a significant …
approach in many application domains. Training DNN models is also becoming a significant …
Elastic parameter server load distribution in deep learning clusters
In distributed DNN training, parameter servers (PS) can become performance bottlenecks
due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …
due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …
Robust searching-based gradient collaborative management in intelligent transportation system
With the rapid development of big data and the Internet of Things (IoT), traffic data from an
Intelligent Transportation System (ITS) is becoming more and more accessible. To …
Intelligent Transportation System (ITS) is becoming more and more accessible. To …
{SHADE}: Enable Fundamental Cacheability for Distributed Deep Learning Training
Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose
new challenges for storage system design. DLT is I/O intensive since data samples need to …
new challenges for storage system design. DLT is I/O intensive since data samples need to …