Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization

L Chen, J Lingys, K Chen, F Liu - Proceedings of the 2018 conference of …, 2018 - dl.acm.org
Traffic optimizations (TO, eg flow scheduling, load balancing) in datacenters are difficult
online decision-making problems. Previously, they are done with heuristics relying on …

Cluster frameworks for efficient scheduling and resource allocation in data center networks: A survey

K Wang, Q Zhou, S Guo, J Luo - IEEE Communications Surveys …, 2018 - ieeexplore.ieee.org
Data centers are widely used for big data analytics, which often involve data-parallel jobs,
including query and web service. Meanwhile, cluster frameworks are rapidly developed for …

Machine learning for computer systems and networking: A survey

ME Kanakis, R Khalili, L Wang - ACM Computing Surveys, 2022 - dl.acm.org
Machine learning (ML) has become the de-facto approach for various scientific domains
such as computer vision and natural language processing. Despite recent breakthroughs …

Sincronia: Near-optimal network design for coflows

S Agarwal, S Rajakrishnan, A Narayan… - Proceedings of the …, 2018 - dl.acm.org
We present Sincronia, a near-optimal network design for coflows that can be implemented
on top on any transport layer (for flows) that supports priority scheduling. Sincronia achieves …

NetworkAI: An intelligent network architecture for self-learning control strategies in software defined networks

H Yao, T Mai, X Xu, P Zhang, M Li… - IEEE Internet of Things …, 2018 - ieeexplore.ieee.org
The past few years have witnessed a wide deployment of software defined networks
facilitating a separation of the control plane from the forwarding plane. However, the work on …

Congestion control in machine learning clusters

S Rajasekaran, M Ghobadi, G Kumar… - Proceedings of the 21st …, 2022 - dl.acm.org
This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …

Is advance knowledge of flow sizes a plausible assumption?

V Ðukić, SA Jyothi, B Karlaš, M Owaida… - … USENIX Symposium on …, 2019 - usenix.org
Recent research has proposed several packet, flow, and coflow scheduling methods that
could substantially improve data center network performance. Most of this work assumes …

[PDF][PDF] Deepweave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling

P Sun, Z Guo, J Wang, J Li, J Lan, Y Hu - Proceedings of the Twenty-Ninth …, 2021 - ijcai.org
To improve the processing efficiency of jobs in distributed computing, the concept of coflow
is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage …

Flow scheduling with imprecise knowledge

W Li, X He, Y Liu, K Li, K Chen, Z Ge, Z Guan… - … USENIX Symposium on …, 2024 - usenix.org
Most existing data center network (DCN) flow scheduling solutions aim to minimize flow
completion times (FCT). However, these solutions either require precise flow information …

Tacc: A full-stack cloud computing infrastructure for machine learning tasks

K Xu, X Wan, H Wang, Z Ren, X Liao, D Sun… - arxiv preprint arxiv …, 2021 - arxiv.org
In Machine Learning (ML) system research, efficient resource scheduling and utilization
have always been an important topic given the compute-intensive nature of ML applications …