Twine: A unified cluster management system for shared infrastructure

C Tang, K Yu, K Veeraraghavan, J Kaldor… - … USENIX Symposium on …, 2020 - usenix.org
We present Twine, Facebook's cluster management system which has been running in
production for the past decade. Twine has helped convert our infrastructure from a collection …

Fail through the cracks: Cross-system interaction failures in modern cloud systems

L Tang, C Bhandari, Y Zhang, A Karanika, S Ji… - Proceedings of the …, 2023 - dl.acm.org
Modern cloud systems are orchestrations of independent and interacting (sub-) systems,
each specializing in important services (eg, data processing, storage, resource …

Turbine: Facebook's service management platform for stream processing

Y Mei, L Cheng, V Talwar, MY Levin… - 2020 IEEE 36th …, 2020 - ieeexplore.ieee.org
The demand for stream processing at Facebook has grown as services increasingly rely on
real-time signals to speed up decisions and actions. Emerging real-time applications require …

Capacity-efficient and uncertainty-resilient backbone network planning with hose

SS Ahuja, V Gupta, V Dangui, S Bali… - Proceedings of the …, 2021 - dl.acm.org
This paper presents Facebook's design and operational experience of a Hose-based
backbone network planning system. This initial adoption of the Hose model in network …

Taiji: managing global user traffic for large-scale internet services at the edge

D Chou, T Xu, K Veeraraghavan, A Newell… - Proceedings of the 27th …, 2019 - dl.acm.org
We present Taiji, a new system for managing user traffic for large-scale Internet services that
accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing …

Check before you change: Preventing correlated failures in service updates

E Zhai, A Chen, R Piskac, M Balakrishnan… - … USENIX Symposium on …, 2020 - usenix.org
The reliability of cloud services can be significantly undermined by correlated failures due to
shared service dependencies, even when the services are already replicated across …

Using distributed tracing to identify inefficient resources composition in cloud applications

C Cassé, P Berthou, P Owezarski… - 2021 IEEE 10th …, 2021 - ieeexplore.ieee.org
Cloud-Applications are the new industry standard way of designing Web-Applications. With
Cloud Computing, Applications are usually designed as microservices, and developers can …

Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications

J Scheuner, S Eismann, S Talluri, E Van Eyk… - arxiv preprint arxiv …, 2022 - arxiv.org
Making serverless computing widely applicable requires detailed performance
understanding. Although contemporary benchmarking approaches exist, they report only …

Swing: Providing Long-Range Lossless RDMA via PFC-Relay

Y Chen, C Tian, J Dong, S Feng… - … on Parallel and …, 2022 - ieeexplore.ieee.org
Remote Direct Memory Access (RDMA) has been widely deployed in datacenters for its high
performance. Large-scale high performance cloud services built on geographically …

Defcon: Preventing Overload with Graceful Feature Degradation

JJ Meza, T Gowda, A Eid, T Ijaware… - … USENIX Symposium on …, 2023 - usenix.org
Every day, billions of people depend on Internet services for communication, commerce, and
entertainment. Yet planetary-scale data center infrastructures consisting of millions of …