A Holistic View of AI-driven Network Incident Management

P Hamadanian, B Arzani, S Fouladi… - Proceedings of the …, 2023 - dl.acm.org
We discuss the potential improvement large language models (LLM) can provide in incident
management and how they can overhaul the ways operators conduct incident management …

Running {BGP} in Data Centers at Scale

A Abhashkumar, K Subramanian, A Andreyev… - … USENIX Symposium on …, 2021 - usenix.org
Border Gateway Protocol (BGP) forms the foundation for routing in the Internet. More
recently, BGP has made serious inroads into data centers on account of its scalability …

A Social Network Under Social Distancing:{Risk-Driven} Backbone Management During {COVID-19} and Beyond

Y **a, Y Zhang, Z Zhong, G Yan, CL Lim… - … USENIX Symposium on …, 2021 - usenix.org
As the COVID-19 pandemic reshapes our social landscape, its lessons have far-reaching
implications on how online service providers manage their infrastructure to mitigate risks …

A composition framework for change management

A Mahimkar, CE de Andrade, R Sinha… - Proceedings of the 2021 …, 2021 - dl.acm.org
Change management has been a long-standing challenge for network operations. The large
scale and diversity of networks, their complex dependencies, and continuous evolution …

{CAPA}: An Architecture For Operating Cluster Networks With High Availability

B Liu, C Scott, M Tariq, A Ferguson, P Gill… - … USENIX Symposium on …, 2024 - usenix.org
Management operations are a major source of outages for networks. A number of best
practices designed to reduce and mitigate such outages are well known, but their …

Boosting bandwidth availability over inter-DC WAN

H Zhang, X Shi, X Yin, J Wang, Z Wang… - Proceedings of the 17th …, 2021 - dl.acm.org
Inter-DataCenter Wide Area Network (Inter-DC WAN) that connects geographically
distributed data centers is becoming one of the most critical network infrastructures. Due to …

Klotski: Efficient and Safe Network Migration of Large Production Datacenters

Y Zhao, X Zhang, H Zhu, Y Zhang, Z Wang… - Proceedings of the …, 2023 - dl.acm.org
This paper presents the design, implementation, evaluation, and deployment of Meta's
production network migration system. We first introduce the network migration problem for …

[HTML][HTML] RADiCe: A Risk Analysis Framework for Data Centers

F Mastenbroek, T De Matteis, V van Beek… - Future Generation …, 2025 - Elsevier
Datacenter service providers face engineering and operational challenges involving
numerous risk aspects. Bad decisions can result in financial penalties, competitive …

Occam: A Programming System for Reliable Network Management

J **ng, KF Hsu, Y **a, Y Cai, Y Li, Y Zhang… - Proceedings of the …, 2024 - dl.acm.org
The complexity of large networks makes their management a daunting task. State-of-the-art
network management tools use workflow systems for automation, but they do not adequately …

Achieving high availability in inter-DC WAN traffic engineering

H Zhang, X Yin, X Shi, J Wang, Z Wang… - IEEE/ACM …, 2022 - ieeexplore.ieee.org
Inter-DataCenter Wide Area Network (Inter-DC WAN) that connects geographically
distributed data centers is becoming one of the most critical network infrastructures. Due to …