Processing big data with apache hadoop in the current challenging era of COVID-19
Big data have become a global strategic issue, as increasingly large amounts of
unstructured data challenge the IT infrastructure of global organizations and threaten their …
unstructured data challenge the IT infrastructure of global organizations and threaten their …
Metastable failures in the wild
Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …
distributed systems called metastable failures. The examples of metastable failures …
Crashtuner: Detecting crash-recovery bugs in cloud systems via meta-info analysis
Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most
severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult …
severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult …
A study of failure recovery and logging of high-performance parallel file systems
Large-scale parallel file systems (PFSs) play an essential role in high-performance
computing (HPC). However, despite their importance, their reliability is much less studied or …
computing (HPC). However, despite their importance, their reliability is much less studied or …
Metastable failures in distributed systems
We describe metastable failures---a failure pattern in distributed systems. Currently,
metastable failures manifest themselves as black swan events; they are outliers because …
metastable failures manifest themselves as black swan events; they are outliers because …
Vicious Cycles in Distributed Software Systems
A major threat to distributed software systems' reliability is vicious cycles, which are
observed when an event in the distributed software system's execution causes a system …
observed when an event in the distributed software system's execution causes a system …
ExaRD: introducing a framework for empowerment of resource discovery to support distributed exascale computing systems with high consistency
E Adibi, E Mousavi Khaneghah - Cluster Computing, 2020 - Springer
In this paper, we introduced the framework to empowerment resource discovery units for
supporting distributed exascale computing systems with high consistency. In addition to the …
supporting distributed exascale computing systems with high consistency. In addition to the …
PerfEstimator: a generic and extensible performance estimator for data parallel DNN training
Understanding the performance of data parallel DNN training at large-scale is crucial for
supporting efficient DNN cloud deployment as well as facilitating the design and …
supporting efficient DNN cloud deployment as well as facilitating the design and …
Leopard: A Black-Box Approach for Efficiently Verifying Various Isolation Levels
Isolation Levels (IL) act as correct contracts between applications and database
management systems (DBMSs). The complex code logic and concurrent interactions among …
management systems (DBMSs). The complex code logic and concurrent interactions among …
FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed Systems
Distributed systems are expected to correctly recover from various faults, eg, node
crash/reboot and network disconnection/reconnection. However, faults that occur under …
crash/reboot and network disconnection/reconnection. However, faults that occur under …