A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

Reliability and energy efficiency in cloud computing systems: Survey and taxonomy

Y Sharma, B Javadi, W Si, D Sun - Journal of Network and Computer …, 2016 - Elsevier
With the popularity of cloud computing, it has become crucial to provide on-demand services
dynamically according to the user's requirements. Reliability and energy efficiency are two …

Reliable computation offloading for edge-computing-enabled software-defined IoV

X Hou, Z Ren, J Wang, W Cheng, Y Ren… - IEEE Internet of …, 2020 - ieeexplore.ieee.org
Internet of Vehicles (IoV) has drawn great interest recent years. Various IoV applications
have emerged for improving the safety, efficiency, and comfort on the road. Cloud computing …

Tachyon: Reliable, memory speed storage for cluster computing frameworks

H Li, A Ghodsi, M Zaharia, S Shenker… - Proceedings of the ACM …, 2014 - dl.acm.org
Tachyon is a distributed file system enabling reliable data sharing at memory speed across
cluster computing frameworks. While caching today improves read workloads, writes are …

A large-scale study of failures in high-performance computing systems

B Schroeder, GA Gibson - IEEE transactions on Dependable …, 2009 - ieeexplore.ieee.org
Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …

A dynamic weight–assignment load balancing approach for workflow scheduling in edge-cloud computing using ameliorated moth flame and rock hyrax optimization …

MI Khaleel - Future Generation Computer Systems, 2024 - Elsevier
As the geographically distributed cloud infrastructure continues to grow in scale and the
intricacy of workflow applications increases, there is a growing threat to the operational …

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors

M Prvulovic, Z Zhang, J Torrellas - ACM SIGARCH Computer …, 2002 - dl.acm.org
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for
shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of …

Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing

A Dogan, F Ozguner - IEEE Transactions on Parallel and …, 2002 - ieeexplore.ieee.org
In a heterogeneous distributed computing system, machine and network failures are
inevitable and can have an adverse effect on applications executing on the system. To …

Adaptive incremental checkpointing for massively parallel systems

S Agarwal, R Garg, MS Gupta, JE Moreira - Proceedings of the 18th …, 2004 - dl.acm.org
Given the scale of massively parallel systems, occurrence of faults is no longer an exception
but a regular event. Periodic checkpointing is becoming increasingly important in these …