Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters
L Reitz, C Fohry - SN Computer Science, 2024 - Springer
Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …
growing. Therefore, hardware failures, such as permanent node failures, become …
Distributed application checkpointing for replicated state machines
Ö Çelikel, T Ovatman - Scalable Computing: Practice and Experience, 2021 - scpe.org
Application checkpointing is a widely used recovery mechanism that consists of saving an
application's state periodically to be used in case of a failure. In this study we investigate the …
application's state periodically to be used in case of a failure. In this study we investigate the …
Limitless FaaS: Overcoming serverless functions execution time limits with invoke driven architecture and memory checkpoints
RL Andraca, M Zareei - 2024 10th International Conference on …, 2024 - ieeexplore.ieee.org
Function-as-a-Service (FaaS) allows to directly submit function code to a cloud provider
without the burden of managing infrastructure resources. Each cloud provider establishes …
without the burden of managing infrastructure resources. Each cloud provider establishes …
[PDF][PDF] Load balancing, fault tolerance, and resource elasticity for asynchronous many-task systems
J Posner - 2021 - kobra.uni-kassel.de
Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …
various scientific fields including key societal problems such as COVID-19. Recently …
Fault-tolerant orchestration of bags-of-tasks with application-directed checkpointing in a distributed environment
A wide spectrum of applications, ranging from big data analytics to financial risk modeling
and genomics, feature a high degree of parallelism, forming bags-of-tasks. Such …
and genomics, feature a high degree of parallelism, forming bags-of-tasks. Such …
[CITATION][C] Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems
C Fohry, M Schulz