Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems
Large, production quality distributed systems still fail periodically, and do so sometimes
catastrophically, where most or all users experience an outage or data loss. We present the …
catastrophically, where most or all users experience an outage or data loss. We present the …
Deterministic replay: A survey
Deterministic replay is a type of emerging technique dedicated to providing deterministic
executions of computer programs in the presence of nondeterministic factors. The …
executions of computer programs in the presence of nondeterministic factors. The …
Improving software diagnosability via log enhancement
Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental
complexity of troubleshooting any complex software system, but further exacerbated by the …
complexity of troubleshooting any complex software system, but further exacerbated by the …
X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software
M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …
such as profiling, tracing, and logging systems, reveal what events occurred during …
Halfmoon: Log-optimal fault-tolerant stateful serverless computing
Serverless computing separates function execution from state management. Simple retry-
based fault tolerance might corrupt the shared state with duplicate updates. Existing …
based fault tolerance might corrupt the shared state with duplicate updates. Existing …
Be conservative: Enhancing failure diagnosis with proactive logging
When systems fail in the field, logged error or warning messages are frequently the only
evidence available for assessing and diagnosing the underlying cause. Consequently, the …
evidence available for assessing and diagnosing the underlying cause. Consequently, the …
TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems
We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …
Rollback-recovery for middleboxes
Network middleboxes must offer high availability, with automatic failover when a device fails.
Achieving high availability is challenging because failover must correctly restore lost state …
Achieving high availability is challenging because failover must correctly restore lost state …
All about eve:{Execute-Verify} replication for {Multi-Core} servers
This paper presents Eve, a new Execute-Verify architecture that allows state machine
replication to scale to multi-core servers. Eve departs from the traditional agree-execute …
replication to scale to multi-core servers. Eve departs from the traditional agree-execute …
Log20: Fully automated optimal placement of log printing statements under specified overhead threshold
When systems fail in production environments, log data is often the only information
available to programmers for postmortem debugging. Consequently, programmers' decision …
available to programmers for postmortem debugging. Consequently, programmers' decision …