Google Académico

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …

Guardar Citar Citado por 75 Artículos relacionados Las 17 versiones

[Free GPT-4]
[DeepSeek]

[PDF] hal.science

On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications

T Ropars, A Guermouche, B Uçar, E Meneses… - Euro-Par 2011 Parallel …, 2011 - Springer

Fault tolerance is becoming a major concern in HPC systems. The two traditional
approaches for message passing applications, coordinated checkpointing and message …

Guardar Citar Citado por 60 Artículos relacionados Las 12 versiones

[Free GPT-4]
[DeepSeek]

[PDF] osti.gov

Using simulation to evaluate the performance of resilience strategies at scale

S Levy, B Topp, KB Ferreira, D Arnold, T Hoefler… - … and Simulation: 4th …, 2014 - Springer

Fault-tolerance has been identified as a major challenge for future extreme-scale systems.
Current predictions suggest that, as systems grow in size, failures will occur more frequently …

Guardar Citar Citado por 41 Artículos relacionados Las 34 versiones

[Free GPT-4]
[DeepSeek]

[PDF] hku.hk

Scalable group-based checkpoint/restart for large-scale message-passing systems

JCY Ho, CL Wang, FCM Lau - 2008 IEEE International …, 2008 - ieeexplore.ieee.org

The ever increasing number of processors used in parallel computers is making fault
tolerance support in large-scale parallel systems more and more important. We discuss the …

Guardar Citar Citado por 50 Artículos relacionados Las 13 versiones

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Minimum process coordinated checkpointing scheme for ad hoc networks

R Tuli, P Kumar - arxiv preprint arxiv:1111.2208, 2011 - arxiv.org

The wireless mobile ad hoc network (MANET) architecture is one consisting of a set of
mobile hosts capable of communicating with each other without the assistance of base …

Guardar Citar Citado por 24 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] informatica.si

[PDF][PDF] A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery

B Gupta, S Rahmi, Y Yang - Informatica, 2007 - informatica.si

In this paper, we present a high performance recovery algorithm for distributed systems in
which checkpoints are taken asynchronously. It offers fast determination of the recent …

Guardar Citar Citado por 25 Artículos relacionados Las 9 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] academia.edu

Dynamic load balance for optimized message logging in fault tolerant hpc applications

E Meneses, LV Kalé… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org

Computing systems will grow significantly larger in the near future to satisfy the needs of
computational scientists in areas like climate modeling, biophysics and cosmology …

Guardar Citar Citado por 19 Artículos relacionados Las 15 versiones

[Free GPT-4]
[DeepSeek]

[PDF] illinois.edu

[LIBRO][B] Scalable message-logging techniques for effective fault tolerance in HPC applications

EM Rojas - 2013 - search.proquest.com

An important set of challenges emerge as the High Performance Computing (HPC)
community aims to reach extreme scale. Resilience and energy consumption are two of …

Guardar Citar Citado por 14 Artículos relacionados Las 7 versiones Búsqueda de bibliotecas

[Free GPT-4]
[DeepSeek]

[PDF] srce.hr

A recovery scheme for cluster federations using sender-based message logging

B Gupta, R Nikolaev, R Chirra - Journal of computing and information …, 2011 - hrcak.srce.hr

Sažetak A cluster federation is a union of clusters and is heterogeneous. Each cluster
contains a certain number of processes. An application running in such a computing …

Guardar Citar Citado por 15 Artículos relacionados Las 15 versiones Versión en HTML

Second-level algorithms, superrecursivity, and recovery problem in distributed systems

M Burgin, B Gupta - Theory of Computing Systems, 2012 - Springer

In this paper, we analyze network recovery algorithms, which allow computer networks to
properly function in spite of failures. In this analysis, we use methods and tools of the theory …

Guardar Citar Citado por 12 Artículos relacionados Las 8 versiones

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Hybrid checkpointing for parallel applications in cluster federations

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications

Using simulation to evaluate the performance of resilience strategies at scale

Scalable group-based checkpoint/restart for large-scale message-passing systems

Minimum process coordinated checkpointing scheme for ad hoc networks

[PDF][PDF] A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery

Dynamic load balance for optimized message logging in fault tolerant hpc applications

[LIBRO][B] Scalable message-logging techniques for effective fault tolerance in HPC applications

A recovery scheme for cluster federations using sender-based message logging

Second-level algorithms, superrecursivity, and recovery problem in distributed systems