Resource-efficient algorithms and systems of foundation models: A survey
Large foundation models, including large language models, vision transformers, diffusion,
and large language model based multimodal models, are revolutionizing the entire machine …
and large language model based multimodal models, are revolutionizing the entire machine …
[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale
A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io
In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …
data and launch ML training workloads in the same region to ensure data and computation …
A survey of resource-efficient llm and multimodal foundation models
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
Reducing energy bloat in large model training
Training large AI models on numerous GPUs consumes a massive amount of energy,
making power delivery one of the largest limiting factors in building and operating …
making power delivery one of the largest limiting factors in building and operating …
Efficient training of large language models on distributed infrastructures: a survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …
their sophisticated capabilities. Training these models requires vast GPU clusters and …
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …
Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections
Deep learning (DL) jobs use multi-dimensional parallelism, ie, combining data, model, and
pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may …
pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may …
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the
course of several days or weeks. At this scale, failures are frequent and can have a big …
course of several days or weeks. At this scale, failures are frequent and can have a big …
Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models
Multimodal large language models (LLMs) have demonstrated significant potential in a wide
range of AI applications. Yet, training multimodal LLMs suffers from low efficiency and …
range of AI applications. Yet, training multimodal LLMs suffers from low efficiency and …
Hybridflow: A flexible and efficient rlhf framework
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language
Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node …
Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node …