Efficient large-scale language model training on gpu clusters using megatron-lm

D Narayanan, M Shoeybi, J Casper… - Proceedings of the …, 2021 - dl.acm.org
Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …

PanGu-: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

W Zeng, X Ren, T Su, H Wang, Y Liao, Z Wang… - arxiv preprint arxiv …, 2021 - arxiv.org
Large-scale Pretrained Language Models (PLMs) have become the new paradigm for
Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as …

Decentralized training of foundation models in heterogeneous environments

B Yuan, Y He, J Davis, T Zhang… - Advances in …, 2022 - proceedings.neurips.cc
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often
involving tens of thousands of GPUs running continuously for months. These models are …

{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training

Z Lin, Y Miao, Q Zhang, F Yang, Y Zhu, C Li… - … USENIX Symposium on …, 2024 - usenix.org
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …

Memory-efficient pipeline-parallel dnn training

D Narayanan, A Phanishayee, K Shi… - International …, 2021 - proceedings.mlr.press
Many state-of-the-art ML results have been obtained by scaling up the number of
parameters in existing models. However, parameters and activations for such large models …

Varuna: scalable, low-cost training of massive deep learning models

S Athlur, N Saran, M Sivathanu, R Ramjee… - Proceedings of the …, 2022 - dl.acm.org
Systems for training massive deep learning models (billions of parameters) today assume
and require specialized" hyperclusters": hundreds or thousands of GPUs wired with …

Chimera: efficiently training large-scale neural networks with bidirectional pipelines

S Li, T Hoefler - Proceedings of the International Conference for High …, 2021 - dl.acm.org
Training large deep learning models at scale is very challenging. This paper proposes
Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for …

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X **… - Proceedings of the 29th …, 2023 - dl.acm.org
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

Gspmd: general and scalable parallelization for ml computation graphs

Y Xu, HJ Lee, D Chen, B Hechtman, Y Huang… - arxiv preprint arxiv …, 2021 - arxiv.org
We present GSPMD, an automatic, compiler-based parallelization system for common
machine learning computations. It allows users to write programs in the same way as for a …

Terapipe: Token-level pipeline parallelism for training large-scale language models

Z Li, S Zhuang, S Guo, D Zhuo… - International …, 2021 - proceedings.mlr.press
Abstract Model parallelism has become a necessity for training modern large-scale deep
language models. In this work, we identify a new and orthogonal dimension from existing …