Google znalac

Y Fu, L Xue, Y Huang, AO Brabete, D Ustiugov… - … USENIX Symposium on …, 2024 - usenix.org

This paper presents ServerlessLLM, a distributed system designed to support low-latency
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …

Spremi Citiraj Spominje se 27 puta Srodni članci Svih 6 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards demystifying serverless machine learning training

J Jiang, S Gan, Y Liu, F Wang, G Alonso… - Proceedings of the …, 2021 - dl.acm.org

The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-
intensive applications such as ETL, query processing, or machine learning (ML). Several …

Spremi Citiraj Spominje se 147 puta Srodni članci Svih 9 inačica

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

Spremi Citiraj Spominje se 206 puta Srodni članci Svih 15 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org

Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

Spremi Citiraj Spominje se 57 puta Srodni članci Svih 7 inačica

[Free GPT-4]
[DeepSeek]

[PDF] github.io

Elasticflow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y **ong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

Spremi Citiraj Spominje se 37 puta Srodni članci Svih 4 inačica

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update

C Sima, Y Fu, MK Sit, L Guo, X Gong, F Lin… - … USENIX Symposium on …, 2022 - usenix.org

Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …

Spremi Citiraj Spominje se 37 puta Srodni članci Svih 7 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Heet: Accelerating elastic training in heterogeneous deep learning clusters

Z Mo, H Xu, C Xu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org

Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such
as computation and communication. This heterogeneity poses a significant challenge for the …

Spremi Citiraj Spominje se 6 puta Srodni članci Svih 2 inačica

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning

P Zheng, R Pan, T Khan, S Venkataraman… - … USENIX Symposium on …, 2023 - usenix.org

Dynamic adaptation has become an essential technique in accelerating distributed machine
learning (ML) training. Recent studies have shown that dynamically adjusting model …

Spremi Citiraj Spominje se 21 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] poliba.it

Distributed analytics for big data: A survey

F Berloco, V Bevilacqua, S Colucci - Neurocomputing, 2024 - Elsevier

In recent years, a constant and fast information growing has characterized digital
applications in the majority of real-life scenarios. Thus, a new information asset, namely Big …

Spremi Citiraj Spominje se 6 puta Srodni članci Svih 3 inačica

[Free GPT-4]
[DeepSeek]

[PDF] archive.org

EasyScale: Elastic training with consistent accuracy and improved utilization on GPUs

M Li, W **ao, H Yang, B Sun, H Zhao, S Ren… - Proceedings of the …, 2023 - dl.acm.org

Distributed synchronized GPU training is commonly used for deep learning. The resource
constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long …

Spremi Citiraj Spominje se 9 puta Srodni članci Svih 4 inačica

Stvori obavijest

Citiraj

Napredno pretraživanje

Spremljeno u Moju knjižnicu

{KungFu}: Making training in distributed machine learning adaptive

{ServerlessLLM}:{Low-Latency} serverless inference for large language models

Towards demystifying serverless machine learning training

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Elasticflow: An elastic serverless training platform for distributed deep learning

Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update

Heet: Accelerating elastic training in heterogeneous deep learning clusters

Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning

Distributed analytics for big data: A survey

EasyScale: Elastic training with consistent accuracy and improved utilization on GPUs