- Academic Search

R Parizotto, BL Coelho, DC Nunes, I Haque… - ACM Computing …, 2023 - dl.acm.org

The demand for machine learning (ML) has increased significantly in recent decades,
enabling several applications, such as speech recognition, computer vision, and …

Simpan Kutip Dirujuk 20 kali Artikel terkait 3 versi

[Free GPT-4]

[PDF] acm.org

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X **… - Proceedings of the 29th …, 2023 - dl.acm.org

Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

Simpan Kutip Dirujuk 34 kali Artikel terkait 7 versi

[Free GPT-4]

[PDF] usenix.org

Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters

J Mohan, A Phanishayee, J Kulkarni… - … USENIX Symposium on …, 2022 - usenix.org

Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …

Simpan Kutip Dirujuk 77 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]

[PDF] usenix.org

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org

Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

Simpan Kutip Dirujuk 34 kali Artikel terkait 6 versi Versi HTML

[Free GPT-4]

[PDF] acm.org

Datastates-llm: Lazy asynchronous checkpointing for large language models

A Maurya, R Underwood, MM Rafique… - Proceedings of the 33rd …, 2024 - dl.acm.org

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

Simpan Kutip Dirujuk 12 kali Artikel terkait 2 versi

[Free GPT-4]

[PDF] usenix.org

Cachew: Machine learning input data processing as a service

D Graur, D Aymon, D Kluser, T Albrici… - 2022 USENIX Annual …, 2022 - usenix.org

Processing input data plays a vital role in ML training, impacting accuracy, throughput, and
cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with …

Simpan Kutip Dirujuk 48 kali Artikel terkait 5 versi Versi HTML

[Free GPT-4]

[PDF] acm.org

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org

Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

Simpan Kutip Dirujuk 51 kali Artikel terkait 6 versi

[Free GPT-4]

[PDF] usenix.org

{Check-N-Run}: A checkpointing system for training deep learning recommendation models

A Eisenman, KK Matam, S Ingram, D Mudigere… - … USENIX Symposium on …, 2022 - usenix.org

Checkpoints play an important role in training long running machine learning (ML) models.
Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that …

Simpan Kutip Dirujuk 79 kali Artikel terkait 8 versi Versi HTML

[Free GPT-4]

[PDF] yezhisheng.me

Chronus: A novel deadline-aware scheduler for deep learning training jobs

W Gao, Z Ye, P Sun, Y Wen, T Zhang - … of the ACM Symposium on Cloud …, 2021 - dl.acm.org

Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner.
Job scheduling is the key to improve the training performance, resource utilization and …

Simpan Kutip Dirujuk 45 kali Artikel terkait 4 versi

[Free GPT-4]

[PDF] yibozhu.com

Lyra: Elastic scheduling for deep learning clusters

J Li, H Xu, Y Zhu, Z Liu, C Guo, C Wang - Proceedings of the Eighteenth …, 2023 - dl.acm.org

Organizations often build separate training and inference clusters for deep learning, and use
separate schedulers to manage them. This leads to problems for both: inference clusters …

Simpan Kutip Dirujuk 31 kali Artikel terkait 6 versi

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing

Offloading machine learning to programmable data planes: A systematic survey

Oobleck: Resilient distributed training of large models using pipeline templates

Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters

Characterization of large language model development in the datacenter

Datastates-llm: Lazy asynchronous checkpointing for large language models

Cachew: Machine learning input data processing as a service

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

{Check-N-Run}: A checkpointing system for training deep learning recommendation models

Chronus: A novel deadline-aware scheduler for deep learning training jobs

Lyra: Elastic scheduling for deep learning clusters