Processing big data with apache hadoop in the current challenging era of COVID-19

O Azeroual, R Fabre - Big Data and Cognitive Computing, 2021 - mdpi.com
Big data have become a global strategic issue, as increasingly large amounts of
unstructured data challenge the IT infrastructure of global organizations and threaten their …

Metastable failures in the wild

L Huang, M Magnusson, AB Muralikrishna… - … USENIX Symposium on …, 2022 - usenix.org
Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …

Crashtuner: Detecting crash-recovery bugs in cloud systems via meta-info analysis

J Lu, C Liu, L Li, X Feng, F Tan, J Yang… - Proceedings of the 27th …, 2019 - dl.acm.org
Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most
severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult …

A study of failure recovery and logging of high-performance parallel file systems

R Han, OR Gatla, M Zheng, J Cao, D Zhang… - ACM Transactions on …, 2022 - dl.acm.org
Large-scale parallel file systems (PFSs) play an essential role in high-performance
computing (HPC). However, despite their importance, their reliability is much less studied or …

Metastable failures in distributed systems

N Bronson, A Aghayev, A Charapko, T Zhu - Proceedings of the …, 2021 - dl.acm.org
We describe metastable failures---a failure pattern in distributed systems. Currently,
metastable failures manifest themselves as black swan events; they are outliers because …

Vicious Cycles in Distributed Software Systems

S Qian, W Fan, L Tan, Y Zhang - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
A major threat to distributed software systems' reliability is vicious cycles, which are
observed when an event in the distributed software system's execution causes a system …

ExaRD: introducing a framework for empowerment of resource discovery to support distributed exascale computing systems with high consistency

E Adibi, E Mousavi Khaneghah - Cluster Computing, 2020 - Springer
In this paper, we introduced the framework to empowerment resource discovery units for
supporting distributed exascale computing systems with high consistency. In addition to the …

PerfEstimator: a generic and extensible performance estimator for data parallel DNN training

C Yang, Z Li, C Ruan, G Xu, C Li… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
Understanding the performance of data parallel DNN training at large-scale is crucial for
supporting efficient DNN cloud deployment as well as facilitating the design and …

Leopard: A Black-Box Approach for Efficiently Verifying Various Isolation Levels

K Li, S Weng, P Liu, L Ni, C Yang… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org
Isolation Levels (IL) act as correct contracts between applications and database
management systems (DBMSs). The complex code logic and concurrent interactions among …

FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed Systems

W Feng, Q Pei, Y Gao, D Wang, W Dou, J Wei… - Proceedings of the …, 2024 - dl.acm.org
Distributed systems are expected to correctly recover from various faults, eg, node
crash/reboot and network disconnection/reconnection. However, faults that occur under …