Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

Comprehensive and systematic study on the fault tolerance architectures in cloud computing

V Mohammadian, NJ Navimipour… - Journal of Circuits …, 2020 - World Scientific
Providing dynamic resources is based on the virtualization features of the cloud
environment. Cloud computing as an emerging technology uses a high availability of …

Multiple fault-tolerance mechanisms in cloud systems: A systematic review

P Marcotte, F Grégoire, F Petrillo - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
Cloud systems are progressively taking over today's software market. These typically require
constant operations with a minimum of failure. Multiple fault-tolerance mechanisms have …

A fuzzy load balancer for adaptive fault tolerance management in cloud platforms

H Arabnejad, C Pahl, G Estrada, A Samir… - Service-Oriented and …, 2017 - Springer
To achieve high levels of reliability, availability and performance in cloud environments, a
fault tolerance approach to handle failures effectively is needed. In most existing research …

Multilevel fault-tolerance aware scheduling technique in cloud environment

K Devi, D Paulraj - Journal of Internet Technology, 2021 - jit.ndhu.edu.tw
In cloud computing, the resources are delivered to the users on demand at a considerable
cost. Due to low maintenance and high scalability services, enterprises wish to deploy their …

FBSGraph: Accelerating asynchronous graph processing via forward and backward swee**

Y Zhang, X Liao, H **, L Gu… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Graph algorithm is pervasive in many applications ranging from targeted advertising to
natural language processing. Recently, Asynchronous Graph Processing (AGP) is becoming …

Match: An mpi fault tolerance benchmark suite

L Guo, G Georgakoudis, K Parasyris… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate
distributed scientific applications running on tens of hundreds of processes and compute …

Resilient corrective control of asynchronous sequential machines against intermittent loss of actuator outputs

JM Yang, SW Kwak - IEEE Transactions on Cybernetics, 2022 - ieeexplore.ieee.org
This article proposes a resilient corrective control scheme for input/state asynchronous
sequential machines (ASMs) against a class of actuator faults in which certain actuator …

HBP: Hotness balanced partition for prioritized iterative graph computations

S Gong, Y Zhang, G Yu - 2020 IEEE 36th International …, 2020 - ieeexplore.ieee.org
Existing graph partition methods are designed for round-robin synchronous distributed
frameworks. They balance workload without discrimination of vertex importance and fail to …

TSH: Easy-to-be distributed partitioning for large-scale graphs

N Wang, Z Wang, Y Gu, Y Bao, G Yu - Future Generation Computer …, 2019 - Elsevier
The big graph era is coming with strong and ever-growing demands on parallel iterative
analysis. But, before that, balanced graph partitioning is a fundamental problem and is NP …