Gremlin: Systematic resilience testing of microservices

V Heorhiadi, S Rajagopalan, H Jamjoom… - 2016 IEEE 36th …, 2016 - ieeexplore.ieee.org
Modern Internet applications are being disaggregated into a microservice-based
architecture, with services being updated and deployed hundreds of times a day. The …

{SAMC}:{Semantic-Aware} model checking for fast discovery of deep bugs in cloud systems

T Leesatapornwongsa, M Hao, P Joshi… - … USENIX Symposium on …, 2014 - usenix.org
The last five years have seen a rise of implementationlevel distributed system model
checkers (dmck) for verifying the reliability of real distributed systems. Existing dmcks …

Flymc: Highly scalable testing of complex interleavings in distributed systems

JF Lukman, H Ke, CA Stuardo, RO Suminto… - Proceedings of the …, 2019 - dl.acm.org
We present a fast and scalable testing approach for datacenter/cloud systems such as
Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability …

An empirical study on crash recovery bugs in large-scale distributed systems

Y Gao, W Dou, F Qin, C Gao, D Wang, J Wei… - Proceedings of the …, 2018 - dl.acm.org
In large-scale distributed systems, node crashes are inevitable, and can happen at any time.
As such, distributed systems are usually designed to be resilient to these node crashes via …

Service-level fault injection testing

CS Meiklejohn, A Estrada, Y Song, H Miller… - Proceedings of the …, 2021 - dl.acm.org
Companies today increasingly rely on microservice architectures to deliver service for their
large-scale mobile or web applications. However, not all developers working on these …

Microfi: Non-intrusive and prioritized request-level fault injection for microservice applications

H Chen, P Chen, G Yu, X Li, Z He - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Microservice is a widely-adopted architecture for constructing cloud-native applications. To
test application resiliency, chaos engineering is widely used to inject faults proactively in …

Crashtuner: Detecting crash-recovery bugs in cloud systems via meta-info analysis

J Lu, C Liu, L Li, X Feng, F Tan, J Yang… - Proceedings of the 27th …, 2019 - dl.acm.org
Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most
severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult …

The operation and maintenance governance of microservices architecture systems: A systematic literature review

L Wang, YX Jiang, Z Wang, QE Huo… - Journal of Software …, 2023 - Wiley Online Library
Due to its development agility, continuous delivery, scalability and other characteristics, the
microservice architecture systems (MASs) have provided complex business functions to …

A study of failure recovery and logging of high-performance parallel file systems

R Han, OR Gatla, M Zheng, J Cao, D Zhang… - ACM Transactions on …, 2022 - dl.acm.org
Large-scale parallel file systems (PFSs) play an essential role in high-performance
computing (HPC). However, despite their importance, their reliability is much less studied or …

FCatch: Automatically detecting time-of-fault bugs in cloud systems

H Liu, X Wang, G Li, S Lu, F Ye, C Tian - ACM SIGPLAN Notices, 2018 - dl.acm.org
It is crucial for distributed systems to achieve high availability. Unfortunately, this is
challenging given the common component failures (ie, faults). Developers often cannot …