- Academic Search

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer

Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …

[Free GPT-4]

[PDF] scpe.org

Distributed application checkpointing for replicated state machines

Ö Çelikel, T Ovatman - Scalable Computing: Practice and Experience, 2021 - scpe.org

Application checkpointing is a widely used recovery mechanism that consists of saving an
application's state periodically to be used in case of a failure. In this study we investigate the …

Save Cite Cited by 5 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Limitless FaaS: Overcoming serverless functions execution time limits with invoke driven architecture and memory checkpoints

RL Andraca, M Zareei - 2024 10th International Conference on …, 2024 - ieeexplore.ieee.org

Function-as-a-Service (FaaS) allows to directly submit function code to a cloud provider
without the burden of managing infrastructure resources. Each cloud provider establishes …

[Free GPT-4]

[PDF] uni-kassel.de

[PDF][PDF] Load balancing, fault tolerance, and resource elasticity for asynchronous many-task systems

J Posner - 2021 - kobra.uni-kassel.de

Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …

Save Cite Cited by 3 Related articles View as HTML

Fault-tolerant orchestration of bags-of-tasks with application-directed checkpointing in a distributed environment

GL Stavrinides, HD Karatza - 2021 International Conference on …, 2021 - ieeexplore.ieee.org

A wide spectrum of applications, ranging from big data analytics to financial risk modeling
and genomics, feature a high degree of parallelism, forming bags-of-tasks. Such …

Save Cite Cited by 2 Related articles

[CITATION][C] Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

C Fohry, M Schulz

Save Cite Related articles

Create alert

Cite

Advanced search

Saved to My library

System-level vs. application-level checkpointing

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Distributed application checkpointing for replicated state machines

Limitless FaaS: Overcoming serverless functions execution time limits with invoke driven architecture and memory checkpoints

[PDF][PDF] Load balancing, fault tolerance, and resource elasticity for asynchronous many-task systems

Fault-tolerant orchestration of bags-of-tasks with application-directed checkpointing in a distributed environment

[CITATION][C] Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems