Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems

D Yuan, Y Luo, X Zhuang, GR Rodrigues… - … USENIX Symposium on …, 2014 - usenix.org
Large, production quality distributed systems still fail periodically, and do so sometimes
catastrophically, where most or all users experience an outage or data loss. We present the …

Deterministic replay: A survey

Y Chen, S Zhang, Q Guo, L Li, R Wu… - ACM Computing Surveys …, 2015 - dl.acm.org
Deterministic replay is a type of emerging technique dedicated to providing deterministic
executions of computer programs in the presence of nondeterministic factors. The …

Improving software diagnosability via log enhancement

D Yuan, J Zheng, S Park, Y Zhou… - ACM Transactions on …, 2012 - dl.acm.org
Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental
complexity of troubleshooting any complex software system, but further exacerbated by the …

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

Halfmoon: Log-optimal fault-tolerant stateful serverless computing

S Qi, X Liu, X ** - Proceedings of the 29th Symposium on Operating …, 2023 - dl.acm.org
Serverless computing separates function execution from state management. Simple retry-
based fault tolerance might corrupt the shared state with duplicate updates. Existing …

Be conservative: Enhancing failure diagnosis with proactive logging

D Yuan, S Park, P Huang, Y Liu, MM Lee… - … USENIX Symposium on …, 2012 - usenix.org
When systems fail in the field, logged error or warning messages are frequently the only
evidence available for assessing and diagnosing the underlying cause. Consequently, the …

TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems

T Leesatapornwongsa, JF Lukman, S Lu… - Proceedings of the …, 2016 - dl.acm.org
We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …

Rollback-recovery for middleboxes

J Sherry, PX Gao, S Basu, A Panda… - Proceedings of the …, 2015 - dl.acm.org
Network middleboxes must offer high availability, with automatic failover when a device fails.
Achieving high availability is challenging because failover must correctly restore lost state …

All about eve:{Execute-Verify} replication for {Multi-Core} servers

M Kapritsos, Y Wang, V Quema, A Clement… - … USENIX Symposium on …, 2012 - usenix.org
This paper presents Eve, a new Execute-Verify architecture that allows state machine
replication to scale to multi-core servers. Eve departs from the traditional agree-execute …

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold

X Zhao, K Rodrigues, Y Luo, M Stumm… - Proceedings of the 26th …, 2017 - dl.acm.org
When systems fail in production environments, log data is often the only information
available to programmers for postmortem debugging. Consequently, programmers' decision …