Google Academic

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Salvați Citați Citat de 9 ori Articole cu conținut similar Toate cele 5 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] ethz.ch

ML training with Cloud GPU shortages: Is cross-region the answer?

F Strati, P Elvinger, T Kerimoglu… - Proceedings of the 4th …, 2024 - dl.acm.org

The widespread adoption of ML has led to a high demand for GPU hardware and
consequently, severe shortages of GPUs in the public cloud. Allocating a sufficient number …

Salvați Citați Citat de 10 ori Articole cu conținut similar Toate cele 5 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lazarus: Resilient and elastic training of mixture-of-experts models with adaptive expert placement

Y Wu, W Qu, T Tao, Z Wang, W Bai, Z Li, Y Tian… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to
further scale large language models (LLMs) due to its sub-linear scaling for computation …

Salvați Citați Citat de 5 ori Articole cu conținut similar Toate cele 3 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads

M Canini, R Bianchini, Í Goiri, D Kostić… - arxiv preprint arxiv …, 2025 - arxiv.org

AI workloads, often hosted in multi-tenant cloud environments, require vast computational
resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack …

Salvați Citați Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack

C Dai, L Lu, P Zhou - arxiv preprint arxiv:2502.16086, 2025 - arxiv.org

Decentralized training has become a resource-efficient framework to democratize the
training of large language models (LLMs). However, the privacy risks associated with this …

Salvați Citați Articole cu conținut similar Toate cele 2 versiuni Afișare ca HTML

Optimizing Distributed Workloads With Infrastructure-Managed Communication and Deployment

Y Wu - 2024 - search.proquest.com

As the scale and complexity of distributed workloads grows, performance is no longer the
sole objective sought by application developers and infrastructure operators, as they …

Salvați Citați Articole cu conținut similar

[Free GPT-4]
[DeepSeek]

[PDF] github.io

[PDF][PDF] Adaptive Resource Allocation to Enhance the Kubernetes Performance for Large-Scale Clusters

J Luo, X Zhao, Y Ma, S Pang, S Deng, J Yin - Memory - hotinfra24.github.io

The advent of cloud computing has led to a dramatic increase in the deployment of hyper-
scale, diverse workloads in containerized form on cloud infrastructures. This expansion …

Salvați Citați Articole cu conținut similar Afișare ca HTML

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances

Efficient training of large language models on distributed infrastructures: a survey

ML training with Cloud GPU shortages: Is cross-region the answer?

Lazarus: Resilient and elastic training of mixture-of-experts models with adaptive expert placement

Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads

Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack

Optimizing Distributed Workloads With Infrastructure-Managed Communication and Deployment

[PDF][PDF] Adaptive Resource Allocation to Enhance the Kubernetes Performance for Large-Scale Clusters