Reducing energy bloat in large model training
Training large AI models on numerous GPUs consumes a massive amount of energy,
making power delivery one of the largest limiting factors in building and operating …
making power delivery one of the largest limiting factors in building and operating …
Efficient training of large language models on distributed infrastructures: a survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …
their sophisticated capabilities. Training these models requires vast GPU clusters and …
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the
course of several days or weeks. At this scale, failures are frequent and can have a big …
course of several days or weeks. At this scale, failures are frequent and can have a big …
A case for server-scale photonic connectivity
The commoditization of machine learning is fuelling the demand for compute required to
both train large models and infer from them. At the same time, scaling the performance of …
both train large models and infer from them. At the same time, scaling the performance of …
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or
weeks at a time. At these scales, failures are frequent and can have a big impact on training …
weeks at a time. At these scales, failures are frequent and can have a big impact on training …
MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale
As large language models scale, distributed training systems increasingly rely on thousands
of GPUs running for days or weeks. Fault tolerance is essential and periodic model …
of GPUs running for days or weeks. Fault tolerance is essential and periodic model …
PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …
Chip-to-chip photonic connectivity in multi-accelerator servers for ML
We present a rack-scale compute architecture for ML using multi-accelerator servers
connected via chip-to-chip silicon photonic components. Our architecture achieves (1) multi …
connected via chip-to-chip silicon photonic components. Our architecture achieves (1) multi …
Architecting a reliable quantum operating system: microkernel, message passing and supercomputing
A Paler - arxiv preprint arxiv:2410.13482, 2024 - arxiv.org
A quantum operating system (QCOS) is a classic software running on classic hardware. The
QCOS is preparing, starting, controlling and managing quantum computations. The reliable …
QCOS is preparing, starting, controlling and managing quantum computations. The reliable …