Offloading machine learning to programmable data planes: A systematic survey

R Parizotto, BL Coelho, DC Nunes, I Haque… - ACM Computing …, 2023 - dl.acm.org
The demand for machine learning (ML) has increased significantly in recent decades,
enabling several applications, such as speech recognition, computer vision, and …

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X **… - Proceedings of the 29th …, 2023 - dl.acm.org
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters

J Mohan, A Phanishayee, J Kulkarni… - … USENIX Symposium on …, 2022 - usenix.org
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

Datastates-llm: Lazy asynchronous checkpointing for large language models

A Maurya, R Underwood, MM Rafique… - Proceedings of the 33rd …, 2024 - dl.acm.org
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

Cachew: Machine learning input data processing as a service

D Graur, D Aymon, D Kluser, T Albrici… - 2022 USENIX Annual …, 2022 - usenix.org
Processing input data plays a vital role in ML training, impacting accuracy, throughput, and
cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with …

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

{Check-N-Run}: A checkpointing system for training deep learning recommendation models

A Eisenman, KK Matam, S Ingram, D Mudigere… - … USENIX Symposium on …, 2022 - usenix.org
Checkpoints play an important role in training long running machine learning (ML) models.
Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that …

Chronus: A novel deadline-aware scheduler for deep learning training jobs

W Gao, Z Ye, P Sun, Y Wen, T Zhang - … of the ACM Symposium on Cloud …, 2021 - dl.acm.org
Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner.
Job scheduling is the key to improve the training performance, resource utilization and …

Lyra: Elastic scheduling for deep learning clusters

J Li, H Xu, Y Zhu, Z Liu, C Guo, C Wang - Proceedings of the Eighteenth …, 2023 - dl.acm.org
Organizations often build separate training and inference clusters for deep learning, and use
separate schedulers to manage them. This leads to problems for both: inference clusters …