Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer
Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …

Distributed application checkpointing for replicated state machines

Ö Çelikel, T Ovatman - Scalable Computing: Practice and Experience, 2021 - scpe.org
Application checkpointing is a widely used recovery mechanism that consists of saving an
application's state periodically to be used in case of a failure. In this study we investigate the …

Limitless FaaS: Overcoming serverless functions execution time limits with invoke driven architecture and memory checkpoints

RL Andraca, M Zareei - 2024 10th International Conference on …, 2024 - ieeexplore.ieee.org
Function-as-a-Service (FaaS) allows to directly submit function code to a cloud provider
without the burden of managing infrastructure resources. Each cloud provider establishes …

[PDF][PDF] Load balancing, fault tolerance, and resource elasticity for asynchronous many-task systems

J Posner - 2021 - kobra.uni-kassel.de
Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …

Fault-tolerant orchestration of bags-of-tasks with application-directed checkpointing in a distributed environment

GL Stavrinides, HD Karatza - 2021 International Conference on …, 2021 - ieeexplore.ieee.org
A wide spectrum of applications, ranging from big data analytics to financial risk modeling
and genomics, feature a high degree of parallelism, forming bags-of-tasks. Such …

[CITATION][C] Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

C Fohry, M Schulz