{Zero-offload}: Democratizing {billion-scale} model training

J Ren, S Rajbhandari, RY Aminabadi… - 2021 USENIX Annual …, 2021 - usenix.org
Large-scale model training has been a playing ground for a limited few requiring complex
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …

Zero: Memory optimizations toward training trillion parameter models

S Rajbhandari, J Rasley, O Ruwase… - … Conference for High …, 2020 - ieeexplore.ieee.org
Large deep learning models offer significant accuracy gains, but training billions to trillions
of parameters is challenging. Existing solutions such as data and model parallelisms exhibit …

8-bit optimizers via block-wise quantization

T Dettmers, M Lewis, S Shleifer… - arxiv preprint arxiv …, 2021 - arxiv.org
Stateful optimizers maintain gradient statistics over time, eg, the exponentially smoothed
sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can …

BPIPE: memory-balanced pipeline parallelism for training large language models

T Kim, H Kim, GI Yu, BG Chun - International Conference on …, 2023 - proceedings.mlr.press
Pipeline parallelism is a key technique for training large language models within GPU
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …

Petals: Collaborative inference and fine-tuning of large models

A Borzunov, D Baranchuk, T Dettmers… - arxiv preprint arxiv …, 2022 - arxiv.org
Many NLP tasks benefit from using large language models (LLMs) that often have more than
100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can …

Distributed inference and fine-tuning of large language models over the internet

A Borzunov, M Ryabinin… - Advances in …, 2024 - proceedings.neurips.cc
Large language models (LLMs) are useful in many NLP tasks and become more capable
with size, with the best open-source models having over 50 billion parameters. However …

Efficient combination of rematerialization and offloading for training dnns

O Beaumont, L Eyraud-Dubois… - Advances in Neural …, 2021 - proceedings.neurips.cc
Rematerialization and offloading are two well known strategies to save memory during the
training phase of deep neural networks, allowing data scientists to consider larger models …

Matching guided distillation

K Yue, J Deng, F Zhou - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer
Feature distillation is an effective way to improve the performance for a smaller student
model, which has fewer parameters and lower computation cost compared to the larger …

Efficient training of large language models on distributed infrastructures: a survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Mpress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism

Q Zhou, H Wang, X Yu, C Li, Y Bai… - … Symposium on High …, 2023 - ieeexplore.ieee.org
It remains challenging to train billion-scale DNN models on a single modern multi-GPU
server due to the GPU memory wall. Unfortunately, existing memory-saving techniques such …