Reducing energy bloat in large model training

JW Chung, Y Gu, I Jang, L Meng, N Bansal… - Proceedings of the …, 2024 - dl.acm.org
Training large AI models on numerous GPUs consumes a massive amount of energy,
making power delivery one of the largest limiting factors in building and operating …

Efficient training of large language models on distributed infrastructures: a survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

S Gandhi, M Zhao, A Skiadopoulos… - Proceedings of the ACM …, 2024 - dl.acm.org
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the
course of several days or weeks. At this scale, failures are frequent and can have a big …

A case for server-scale photonic connectivity

AV Kumar, A Devraj, D Bunandar, R Singh - Proceedings of the 23rd …, 2024 - dl.acm.org
The commoditization of machine learning is fuelling the demand for compute required to
both train large models and infer from them. At the same time, scaling the performance of …

SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

S Gandhi, M Zhao, A Skiadopoulos… - arxiv preprint arxiv …, 2024 - arxiv.org
Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or
weeks at a time. At these scales, failures are frequent and can have a big impact on training …

MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale

S Gandhi, C Kozyrakis - arxiv preprint arxiv:2412.15411, 2024 - arxiv.org
As large language models scale, distributed training systems increasingly rely on thousands
of GPUs running for days or weeks. Fault tolerance is essential and periodic model …

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

Z Huang, X Wei, Y Hao, R Chen, M Han, J Gu… - arxiv preprint arxiv …, 2024 - arxiv.org
Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …

Chip-to-chip photonic connectivity in multi-accelerator servers for ML

AV Kumar, A Devraj, D Bunandar, R Singh - arxiv preprint arxiv …, 2025 - arxiv.org
We present a rack-scale compute architecture for ML using multi-accelerator servers
connected via chip-to-chip silicon photonic components. Our architecture achieves (1) multi …

Architecting a reliable quantum operating system: microkernel, message passing and supercomputing

A Paler - arxiv preprint arxiv:2410.13482, 2024 - arxiv.org
A quantum operating system (QCOS) is a classic software running on classic hardware. The
QCOS is preparing, starting, controlling and managing quantum computations. The reliable …