- Academic Search

O Azeroual, R Fabre - Big Data and Cognitive Computing, 2021 - mdpi.com

Big data have become a global strategic issue, as increasingly large amounts of
unstructured data challenge the IT infrastructure of global organizations and threaten their …

Save Cite Cited by 48 Related articles All 5 versions Free GPT-4 Cached

[Free GPT-4]

[PDF] usenix.org

Metastable failures in the wild

L Huang, M Magnusson, AB Muralikrishna… - … USENIX Symposium on …, 2022 - usenix.org

Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …

Save Cite Cited by 33 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] otago.ac.nz

Crashtuner: Detecting crash-recovery bugs in cloud systems via meta-info analysis

J Lu, C Liu, L Li, X Feng, F Tan, J Yang… - Proceedings of the 27th …, 2019 - dl.acm.org

Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most
severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult …

Save Cite Cited by 37 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

A study of failure recovery and logging of high-performance parallel file systems

R Han, OR Gatla, M Zheng, J Cao, D Zhang… - ACM Transactions on …, 2022 - dl.acm.org

Large-scale parallel file systems (PFSs) play an essential role in high-performance
computing (HPC). However, despite their importance, their reliability is much less studied or …

Save Cite Cited by 19 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Metastable failures in distributed systems

N Bronson, A Aghayev, A Charapko, T Zhu - Proceedings of the …, 2021 - dl.acm.org

We describe metastable failures---a failure pattern in distributed systems. Currently,
metastable failures manifest themselves as black swan events; they are outliers because …

Save Cite Cited by 27 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] purdue.edu

Vicious Cycles in Distributed Software Systems

S Qian, W Fan, L Tan, Y Zhang - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org

A major threat to distributed software systems' reliability is vicious cycles, which are
observed when an event in the distributed software system's execution causes a system …

Save Cite Cited by 3 Related articles All 6 versions Free GPT-4

ExaRD: introducing a framework for empowerment of resource discovery to support distributed exascale computing systems with high consistency

E Adibi, E Mousavi Khaneghah - Cluster Computing, 2020 - Springer

In this paper, we introduced the framework to empowerment resource discovery units for
supporting distributed exascale computing systems with high consistency. In addition to the …

Save Cite Cited by 11 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] nsf.gov

PerfEstimator: a generic and extensible performance estimator for data parallel DNN training

C Yang, Z Li, C Ruan, G Xu, C Li… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org

Understanding the performance of data parallel DNN training at large-scale is crucial for
supporting efficient DNN cloud deployment as well as facilitating the design and …

Save Cite Cited by 10 Related articles All 4 versions Free GPT-4

Leopard: A Black-Box Approach for Efficiently Verifying Various Isolation Levels

K Li, S Weng, P Liu, L Ni, C Yang… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org

Isolation Levels (IL) act as correct contracts between applications and database
management systems (DBMSs). The complex code logic and concurrent interactions among …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed Systems

W Feng, Q Pei, Y Gao, D Wang, W Dou, J Wei… - Proceedings of the …, 2024 - dl.acm.org

Distributed systems are expected to correctly recover from various faults, eg, node
crash/reboot and network disconnection/reconnection. However, faults that occur under …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

{ScaleCheck}: A {Single-Machine} Approach for Discovering Scalability Bugs in Large Distributed...

Processing big data with apache hadoop in the current challenging era of COVID-19

Metastable failures in the wild

Crashtuner: Detecting crash-recovery bugs in cloud systems via meta-info analysis

A study of failure recovery and logging of high-performance parallel file systems

Metastable failures in distributed systems

Vicious Cycles in Distributed Software Systems

ExaRD: introducing a framework for empowerment of resource discovery to support distributed exascale computing systems with high consistency

PerfEstimator: a generic and extensible performance estimator for data parallel DNN training

Leopard: A Black-Box Approach for Efficiently Verifying Various Isolation Levels

FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed Systems