- Academic Search

JW Chung, Y Gu, I Jang, L Meng, N Bansal… - Proceedings of the …, 2024 - dl.acm.org

Training large AI models on numerous GPUs consumes a massive amount of energy,
making power delivery one of the largest limiting factors in building and operating …

Save Cite Cited by 10 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Efficient training of large language models on distributed infrastructures: a survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Save Cite Cited by 6 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] swapnilgandhi.com

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

S Gandhi, M Zhao, A Skiadopoulos… - Proceedings of the ACM …, 2024 - dl.acm.org

Training large Deep Neural Network (DNN) models requires thousands of GPUs over the
course of several days or weeks. At this scale, failures are frequent and can have a big …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

A case for server-scale photonic connectivity

AV Kumar, A Devraj, D Bunandar, R Singh - Proceedings of the 23rd …, 2024 - dl.acm.org

The commoditization of machine learning is fuelling the demand for compute required to
both train large models and infer from them. At the same time, scaling the performance of …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

S Gandhi, M Zhao, A Skiadopoulos… - arxiv preprint arxiv …, 2024 - arxiv.org

Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or
weeks at a time. At these scales, failures are frequent and can have a big impact on training …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale

S Gandhi, C Kozyrakis - arxiv preprint arxiv:2412.15411, 2024 - arxiv.org

As large language models scale, distributed training systems increasingly rely on thousands
of GPUs running for days or weeks. Fault tolerance is essential and periodic model …

Save Cite Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

Z Huang, X Wei, Y Hao, R Chen, M Han, J Gu… - arxiv preprint arxiv …, 2024 - arxiv.org

Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Chip-to-chip photonic connectivity in multi-accelerator servers for ML

AV Kumar, A Devraj, D Bunandar, R Singh - arxiv preprint arxiv …, 2025 - arxiv.org

We present a rack-scale compute architecture for ML using multi-accelerator servers
connected via chip-to-chip silicon photonic components. Our architecture achieves (1) multi …

Save Cite Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Architecting a reliable quantum operating system: microkernel, message passing and supercomputing

A Paler - arxiv preprint arxiv:2410.13482, 2024 - arxiv.org

A quantum operating system (QCOS) is a classic software running on classic hardware. The
QCOS is preparing, starting, controlling and managing quantum computations. The reliable …

Create alert

Cite

Advanced search

Saved to My library

Resiliency at Scale: Managing {Google’s}{TPUv4} Machine Learning Supercomputer

Reducing energy bloat in large model training

Efficient training of large language models on distributed infrastructures: a survey

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

A case for server-scale photonic connectivity

SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

Chip-to-chip photonic connectivity in multi-accelerator servers for ML

Architecting a reliable quantum operating system: microkernel, message passing and supercomputing