A survey on data center networking (DCN): Infrastructure and operations

W **a, P Zhao, Y Wen, H **e - IEEE communications surveys & …, 2016‏ - ieeexplore.ieee.org
Data centers (DCs), owing to the exponential growth of Internet services, have emerged as
an irreplaceable and crucial infrastructure to power this ever-growing trend. A DC typically …

{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024‏ - usenix.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

Fault management in software-defined networking: A survey

Y Yu, X Li, X Leng, L Song, K Bu… - … Surveys & Tutorials, 2018‏ - ieeexplore.ieee.org
Software-defined networking (SDN) has emerged as a new network paradigm that promises
control/data plane separation and centralized network control. While these features simplify …

PINT: Probabilistic in-band network telemetry

R Ben Basat, S Ramanathan, Y Li, G Antichi… - Proceedings of the …, 2020‏ - dl.acm.org
Commodity network devices support adding in-band telemetry measurements into data
packets, enabling a wide range of applications, including network troubleshooting …

In-band network telemetry: A survey

L Tan, W Su, W Zhang, J Lv, Z Zhang, J Miao, X Liu… - Computer Networks, 2021‏ - Elsevier
With the development of software-defined network and programmable data-plane
technology, in-band network telemetry has emerged. In-band network telemetry technology …

An exhaustive survey on p4 programmable data plane switches: Taxonomy, applications, challenges, and future trends

EF Kfoury, J Crichigno, E Bou-Harb - IEEE access, 2021‏ - ieeexplore.ieee.org
Traditionally, the data plane has been designed with fixed functions to forward packets using
a small set of protocols. This closed-design paradigm has limited the capability of the …

Sonata: Query-driven streaming network telemetry

A Gupta, R Harrison, M Canini, N Feamster… - Proceedings of the …, 2018‏ - dl.acm.org
Managing and securing networks requires collecting and analyzing network traffic data in
real time. Existing telemetry systems do not allow operators to express the range of queries …

CocoSketch: High-performance sketch-based measurement over arbitrary partial key query

Y Zhang, Z Liu, R Wang, T Yang, J Li, R Miao… - Proceedings of the …, 2021‏ - dl.acm.org
Sketch-based measurement has emerged as a promising alternative to the traditional
sampling-based network measurement approaches due to its high accuracy and resource …

Language-directed hardware design for network performance monitoring

S Narayana, A Sivaraman, V Nathan, P Goyal… - Proceedings of the …, 2017‏ - dl.acm.org
Network performance monitoring today is restricted by existing switch support for
measurement, forcing operators to rely heavily on endpoints with poor visibility into the …

{FlowRadar}: A better {NetFlow} for data centers

Y Li, R Miao, C Kim, M Yu - 13th USENIX symposium on networked …, 2016‏ - usenix.org
NetFlow has been a widely used monitoring tool with a variety of applications. NetFlow
maintains an active working set of flows in a hash table that supports flow insertion, collision …