Twine: A unified cluster management system for shared infrastructure
We present Twine, Facebook's cluster management system which has been running in
production for the past decade. Twine has helped convert our infrastructure from a collection …
production for the past decade. Twine has helped convert our infrastructure from a collection …
Fail through the cracks: Cross-system interaction failures in modern cloud systems
Modern cloud systems are orchestrations of independent and interacting (sub-) systems,
each specializing in important services (eg, data processing, storage, resource …
each specializing in important services (eg, data processing, storage, resource …
Turbine: Facebook's service management platform for stream processing
The demand for stream processing at Facebook has grown as services increasingly rely on
real-time signals to speed up decisions and actions. Emerging real-time applications require …
real-time signals to speed up decisions and actions. Emerging real-time applications require …
Capacity-efficient and uncertainty-resilient backbone network planning with hose
This paper presents Facebook's design and operational experience of a Hose-based
backbone network planning system. This initial adoption of the Hose model in network …
backbone network planning system. This initial adoption of the Hose model in network …
Taiji: managing global user traffic for large-scale internet services at the edge
We present Taiji, a new system for managing user traffic for large-scale Internet services that
accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing …
accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing …
Check before you change: Preventing correlated failures in service updates
The reliability of cloud services can be significantly undermined by correlated failures due to
shared service dependencies, even when the services are already replicated across …
shared service dependencies, even when the services are already replicated across …
Using distributed tracing to identify inefficient resources composition in cloud applications
Cloud-Applications are the new industry standard way of designing Web-Applications. With
Cloud Computing, Applications are usually designed as microservices, and developers can …
Cloud Computing, Applications are usually designed as microservices, and developers can …
Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications
Making serverless computing widely applicable requires detailed performance
understanding. Although contemporary benchmarking approaches exist, they report only …
understanding. Although contemporary benchmarking approaches exist, they report only …
Swing: Providing Long-Range Lossless RDMA via PFC-Relay
Remote Direct Memory Access (RDMA) has been widely deployed in datacenters for its high
performance. Large-scale high performance cloud services built on geographically …
performance. Large-scale high performance cloud services built on geographically …
Defcon: Preventing Overload with Graceful Feature Degradation
JJ Meza, T Gowda, A Eid, T Ijaware… - … USENIX Symposium on …, 2023 - usenix.org
Every day, billions of people depend on Internet services for communication, commerce, and
entertainment. Yet planetary-scale data center infrastructures consisting of millions of …
entertainment. Yet planetary-scale data center infrastructures consisting of millions of …