Resource-efficient algorithms and systems of foundation models: A survey

M Xu, D Cai, W Yin, S Wang, X **, X Liu - ACM Computing Surveys, 2025 - dl.acm.org
Large foundation models, including large language models, vision transformers, diffusion,
and large language model based multimodal models, are revolutionizing the entire machine …

[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale

A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io
In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Reducing energy bloat in large model training

JW Chung, Y Gu, I Jang, L Meng, N Bansal… - Proceedings of the …, 2024 - dl.acm.org
Training large AI models on numerous GPUs consumes a massive amount of energy,
making power delivery one of the largest limiting factors in building and operating …

Efficient training of large language models on distributed infrastructures: a survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

M Wagenländer, G Li, B Zhao, L Mai… - Proceedings of the ACM …, 2024 - dl.acm.org
Deep learning (DL) jobs use multi-dimensional parallelism, ie, combining data, model, and
pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may …

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

S Gandhi, M Zhao, A Skiadopoulos… - Proceedings of the ACM …, 2024 - dl.acm.org
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the
course of several days or weeks. At this scale, failures are frequent and can have a big …

Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

Z Zhang, Y Zhong, R Ming, H Hu, J Sun, Z Ge… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (LLMs) have demonstrated significant potential in a wide
range of AI applications. Yet, training multimodal LLMs suffers from low efficiency and …

Hybridflow: A flexible and efficient rlhf framework

G Sheng, C Zhang, Z Ye, X Wu, W Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language
Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node …