{Zero-offload}: Democratizing {billion-scale} model training
Large-scale model training has been a playing ground for a limited few requiring complex
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …
Zero: Memory optimizations toward training trillion parameter models
Large deep learning models offer significant accuracy gains, but training billions to trillions
of parameters is challenging. Existing solutions such as data and model parallelisms exhibit …
of parameters is challenging. Existing solutions such as data and model parallelisms exhibit …
8-bit optimizers via block-wise quantization
Stateful optimizers maintain gradient statistics over time, eg, the exponentially smoothed
sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can …
sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can …
BPIPE: memory-balanced pipeline parallelism for training large language models
Pipeline parallelism is a key technique for training large language models within GPU
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …
Petals: Collaborative inference and fine-tuning of large models
Many NLP tasks benefit from using large language models (LLMs) that often have more than
100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can …
100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can …
Distributed inference and fine-tuning of large language models over the internet
Large language models (LLMs) are useful in many NLP tasks and become more capable
with size, with the best open-source models having over 50 billion parameters. However …
with size, with the best open-source models having over 50 billion parameters. However …
Efficient combination of rematerialization and offloading for training dnns
Rematerialization and offloading are two well known strategies to save memory during the
training phase of deep neural networks, allowing data scientists to consider larger models …
training phase of deep neural networks, allowing data scientists to consider larger models …
Matching guided distillation
Feature distillation is an effective way to improve the performance for a smaller student
model, which has fewer parameters and lower computation cost compared to the larger …
model, which has fewer parameters and lower computation cost compared to the larger …
Efficient training of large language models on distributed infrastructures: a survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …
their sophisticated capabilities. Training these models requires vast GPU clusters and …
Mpress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism
Q Zhou, H Wang, X Yu, C Li, Y Bai… - … Symposium on High …, 2023 - ieeexplore.ieee.org
It remains challenging to train billion-scale DNN models on a single modern multi-GPU
server due to the GPU memory wall. Unfortunately, existing memory-saving techniques such …
server due to the GPU memory wall. Unfortunately, existing memory-saving techniques such …