محقق Google

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024‏ - dl.acm.org‏

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …‏

ذخیره ارجاع بیان شده در 23 یافته مقاله‌های مربوط تمام نسخه‌های 5

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

{THC}: Accelerating distributed deep learning using tensor homomorphic compression‏

M Li, RB Basat, S Vargaftik, CL Lao, K Xu… - … USENIX Symposium on …, 2024‏ - usenix.org‏

Deep neural networks (DNNs) are the de facto standard for essential use cases, such as
image classification, computer vision, and natural language processing. As DNNs and …‏

ذخیره ارجاع بیان شده در 15 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Resource allocation and workload scheduling for large-scale distributed deep learning: A survey‏

F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

With rapidly increasing distributed deep learning workloads in large-scale data centers,
efficient distributed deep learning framework strategies for resource allocation and workload …‏

ذخیره ارجاع بیان شده در 7 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] stanford.edu

Crux: Gpu-efficient communication scheduling for deep learning training‏

J Cao, Y Guan, K Qian, J Gao, W **ao, J Dong… - Proceedings of the …, 2024‏ - dl.acm.org‏

Deep learning training (DLT), eg, large language model (LLM) training, has become one of
the most important services in multitenant cloud computing. By deeply studying in …‏

ذخیره ارجاع بیان شده در 4 یافته مقاله‌های مربوط تمام نسخه‌های 4

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning‏

S Rajasekaran, S Narang, AA Zabreyko… - Proceedings of the 23rd …, 2024‏ - dl.acm.org‏

This paper argues that congestion control protocols in machine learning datacenters sit at a
sweet spot between centralized and distributed flow scheduling solutions. We present …‏

ذخیره ارجاع بیان شده در 1 یافته مقاله‌های مربوط تمام نسخه‌های 4

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mltcp: Congestion control for dnn training‏

S Rajasekaran, S Narang, AA Zabreyko… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We present MLTCP, a technique to augment today's congestion control algorithms to
accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication …‏

ذخیره ارجاع بیان شده در 3 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks‏

L Liu, X Xu, P Zhou, X Chen, D Ergu, H Yu, G Sun… - Neurocomputing, 2025‏ - Elsevier‏

With the increasing size of training datasets and models, parameter synchronization stage
puts a heavy burden on the network, and communication has become one of the main …‏

ذخیره ارجاع مقاله‌های مربوط

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Understanding the Throughput Bounds of Reconfigurable Datacenter Networks‏

V Addanki, C Avin, S Schmid - arxiv preprint arxiv:2405.20869, 2024‏ - arxiv.org‏

The increasing gap between the growth of datacenter traffic volume and the capacity of
electrical switches led to the emergence of reconfigurable datacenter network designs …‏

ذخیره ارجاع بیان شده در 1 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Communication optimization for distributed training: architecture, advances, and opportunities‏

Y Wei, T Hu, C Liang, Y Cui - IEEE Network, 2024‏ - ieeexplore.ieee.org‏

The past few years have witnessed the flourishing of large-scale deep neural network
models with ever-growing parameter numbers. Training such large-scale models typically …‏

ذخیره ارجاع بیان شده در 1 یافته مقاله‌های مربوط تمام نسخه‌های 3

Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning System‏

Y Li, J Huang, Z Li, J Liu, S Zhou… - IEEE/ACM …, 2024‏ - ieeexplore.ieee.org‏

Deep Neural Network (DNN) is a critical component of a wide range of applications.
However, with the rapid growth of the training dataset and model size, communication …‏

ذخیره ارجاع مقاله‌های مربوط تمام نسخه‌های 4

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

Deep learning workload scheduling in gpu datacenters: A survey‏

{THC}: Accelerating distributed deep learning using tensor homomorphic compression‏

Resource allocation and workload scheduling for large-scale distributed deep learning: A survey‏

Crux: Gpu-efficient communication scheduling for deep learning training‏

MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning‏

Mltcp: Congestion control for dnn training‏

PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks‏

Understanding the Throughput Bounds of Reconfigurable Datacenter Networks‏

Communication optimization for distributed training: architecture, advances, and opportunities‏

Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning System‏