- Academic Search

Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems

D Yuan, Y Luo, X Zhuang, GR Rodrigues… - … USENIX Symposium on …, 2014 - usenix.org

Large, production quality distributed systems still fail periodically, and do so sometimes
catastrophically, where most or all users experience an outage or data loss. We present the …

Speichern Zitieren Zitiert von: 304 Ähnliche Artikel Alle 26 Versionen HTML-Version

Deterministic replay: A survey

Y Chen, S Zhang, Q Guo, L Li, R Wu… - ACM Computing Surveys …, 2015 - dl.acm.org

Deterministic replay is a type of emerging technique dedicated to providing deterministic
executions of computer programs in the presence of nondeterministic factors. The …

Speichern Zitieren Zitiert von: 72 Ähnliche Artikel Alle 4 Versionen

[Free GPT-4]

[PDF] toronto.edu

Improving software diagnosability via log enhancement

D Yuan, J Zheng, S Park, Y Zhou… - ACM Transactions on …, 2012 - dl.acm.org

Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental
complexity of troubleshooting any complex software system, but further exacerbated by the …

Speichern Zitieren Zitiert von: 367 Ähnliche Artikel Alle 32 Versionen

[Free GPT-4]

[PDF] usenix.org

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org

Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

Speichern Zitieren Zitiert von: 349 Ähnliche Artikel Alle 18 Versionen HTML-Version

[Free GPT-4]

[PDF] github.io

Halfmoon: Log-optimal fault-tolerant stateful serverless computing

S Qi, X Liu, X ** - Proceedings of the 29th Symposium on Operating …, 2023 - dl.acm.org

Serverless computing separates function execution from state management. Simple retry-
based fault tolerance might corrupt the shared state with duplicate updates. Existing …

Speichern Zitieren Zitiert von: 12 Ähnliche Artikel Alle 4 Versionen

[Free GPT-4]

[PDF] usenix.org

Be conservative: Enhancing failure diagnosis with proactive logging

D Yuan, S Park, P Huang, Y Liu, MM Lee… - … USENIX Symposium on …, 2012 - usenix.org

When systems fail in the field, logged error or warning messages are frequently the only
evidence available for assessing and diagnosing the underlying cause. Consequently, the …

Speichern Zitieren Zitiert von: 263 Ähnliche Artikel Alle 10 Versionen HTML-Version

[Free GPT-4]

[PDF] acm.org

TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems

T Leesatapornwongsa, JF Lukman, S Lu… - Proceedings of the …, 2016 - dl.acm.org

We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …

Speichern Zitieren Zitiert von: 195 Ähnliche Artikel Alle 6 Versionen

[Free GPT-4]

[PDF] acm.org

Rollback-recovery for middleboxes

J Sherry, PX Gao, S Basu, A Panda… - Proceedings of the …, 2015 - dl.acm.org

Network middleboxes must offer high availability, with automatic failover when a device fails.
Achieving high availability is challenging because failover must correctly restore lost state …

Speichern Zitieren Zitiert von: 210 Ähnliche Artikel Alle 38 Versionen

[Free GPT-4]

[PDF] usenix.org

All about eve:{Execute-Verify} replication for {Multi-Core} servers

M Kapritsos, Y Wang, V Quema, A Clement… - … USENIX Symposium on …, 2012 - usenix.org

This paper presents Eve, a new Execute-Verify architecture that allows state machine
replication to scale to multi-core servers. Eve departs from the traditional agree-execute …

Speichern Zitieren Zitiert von: 268 Ähnliche Artikel Alle 19 Versionen HTML-Version

[Free GPT-4]

[PDF] acm.org

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold

X Zhao, K Rodrigues, Y Luo, M Stumm… - Proceedings of the 26th …, 2017 - dl.acm.org

When systems fail in production environments, log data is often the only information
available to programmers for postmortem debugging. Consequently, programmers' decision …

Speichern Zitieren Zitiert von: 143 Ähnliche Artikel Alle 7 Versionen

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

DoublePlay: Parallelizing sequential logging and replay

Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems

Deterministic replay: A survey

Improving software diagnosability via log enhancement

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

Halfmoon: Log-optimal fault-tolerant stateful serverless computing

Be conservative: Enhancing failure diagnosis with proactive logging

TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems

Rollback-recovery for middleboxes

All about eve:{Execute-Verify} replication for {Multi-Core} servers

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold