Offloading machine learning to programmable data planes: A systematic survey
The demand for machine learning (ML) has increased significantly in recent decades,
enabling several applications, such as speech recognition, computer vision, and …
enabling several applications, such as speech recognition, computer vision, and …
Oobleck: Resilient distributed training of large models using pipeline templates
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …
Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …
Characterization of large language model development in the datacenter
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …
Datastates-llm: Lazy asynchronous checkpointing for large language models
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …
performance computing (HPC) infrastructures and ingest massive amounts of input data …
Cachew: Machine learning input data processing as a service
D Graur, D Aymon, D Kluser, T Albrici… - 2022 USENIX Annual …, 2022 - usenix.org
Processing input data plays a vital role in ML training, impacting accuracy, throughput, and
cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with …
cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with …
Gemini: Fast failure recovery in distributed training with in-memory checkpoints
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …
academia and industry. Nonetheless, frequent failures are observed during large model …
{Check-N-Run}: A checkpointing system for training deep learning recommendation models
Checkpoints play an important role in training long running machine learning (ML) models.
Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that …
Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that …
Chronus: A novel deadline-aware scheduler for deep learning training jobs
Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner.
Job scheduling is the key to improve the training performance, resource utilization and …
Job scheduling is the key to improve the training performance, resource utilization and …
Lyra: Elastic scheduling for deep learning clusters
Organizations often build separate training and inference clusters for deep learning, and use
separate schedulers to manage them. This leads to problems for both: inference clusters …
separate schedulers to manage them. This leads to problems for both: inference clusters …