Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization
Traffic optimizations (TO, eg flow scheduling, load balancing) in datacenters are difficult
online decision-making problems. Previously, they are done with heuristics relying on …
online decision-making problems. Previously, they are done with heuristics relying on …
Cluster frameworks for efficient scheduling and resource allocation in data center networks: A survey
Data centers are widely used for big data analytics, which often involve data-parallel jobs,
including query and web service. Meanwhile, cluster frameworks are rapidly developed for …
including query and web service. Meanwhile, cluster frameworks are rapidly developed for …
Machine learning for computer systems and networking: A survey
Machine learning (ML) has become the de-facto approach for various scientific domains
such as computer vision and natural language processing. Despite recent breakthroughs …
such as computer vision and natural language processing. Despite recent breakthroughs …
Sincronia: Near-optimal network design for coflows
We present Sincronia, a near-optimal network design for coflows that can be implemented
on top on any transport layer (for flows) that supports priority scheduling. Sincronia achieves …
on top on any transport layer (for flows) that supports priority scheduling. Sincronia achieves …
NetworkAI: An intelligent network architecture for self-learning control strategies in software defined networks
The past few years have witnessed a wide deployment of software defined networks
facilitating a separation of the control plane from the forwarding plane. However, the work on …
facilitating a separation of the control plane from the forwarding plane. However, the work on …
Congestion control in machine learning clusters
This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …
Is advance knowledge of flow sizes a plausible assumption?
Recent research has proposed several packet, flow, and coflow scheduling methods that
could substantially improve data center network performance. Most of this work assumes …
could substantially improve data center network performance. Most of this work assumes …
[PDF][PDF] Deepweave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling
P Sun, Z Guo, J Wang, J Li, J Lan, Y Hu - Proceedings of the Twenty-Ninth …, 2021 - ijcai.org
To improve the processing efficiency of jobs in distributed computing, the concept of coflow
is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage …
is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage …
Flow scheduling with imprecise knowledge
Most existing data center network (DCN) flow scheduling solutions aim to minimize flow
completion times (FCT). However, these solutions either require precise flow information …
completion times (FCT). However, these solutions either require precise flow information …
Tacc: A full-stack cloud computing infrastructure for machine learning tasks
In Machine Learning (ML) system research, efficient resource scheduling and utilization
have always been an important topic given the compute-intensive nature of ML applications …
have always been an important topic given the compute-intensive nature of ML applications …