- Academic Search

Cooperative application/OS DRAM fault recovery

Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Z Chen - ACM SIGPLAN Notices, 2013 - dl.acm.org

Soft errors are one-time events that corrupt the state of a computing system but not its overall
functionality. Large supercomputers are especially susceptible to soft errors because of their …

Gem Citer Citeret af 222 Relaterede artikler Alle 2 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arxiv preprint arxiv:1708.07422, 2017 - arxiv.org

Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

Gem Citer Citeret af 53 Relaterede artikler Alle 21 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] uchicago.edu

When is multi-version checkpointing needed?

G Lu, Z Zheng, AA Chien - Proceedings of the 3rd Workshop on Fault …, 2013 - dl.acm.org

The scaling of semiconductor technology and increasing power concerns combined with
system scale make fault management a growing concern in high performance computing …

Gem Citer Citeret af 75 Relaterede artikler Alle 2 versioner

[Free GPT-4]
[DeepSeek]

[PDF] upc.edu

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org

This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

Gem Citer Citeret af 45 Relaterede artikler Alle 10 versioner

[Free GPT-4]
[DeepSeek]

[PDF] sciencedirect.com Full View

Versioned distributed arrays for resilience in scientific applications: Global view resilience

A Chien, P Balaji, P Beckman, N Dun, A Fang… - Procedia Computer …, 2015 - Elsevier

Exascale studies project reliability challenges for future high-performance computing (HPC)
systems. We propose the Global View Resilience (GVR) system, a library that enables …

Gem Citer Citeret af 42 Relaterede artikler Alle 16 versioner

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Response of HPC hardware to neutron radiation at the dawn of exascale

A Bustos, AJ Rubio-Montero, R Méndez… - The Journal of …, 2023 - Springer

Every computation presents a small chance that an unexpected phenomenon ruins or
modifies its output. Computers are prone to errors that, although may be very unlikely, are …

Gem Citer Citeret af 3 Relaterede artikler Alle 5 versioner

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

A block-asynchronous relaxation method for graphics processing units

H Anzt, S Tomov, J Dongarra, V Heuveline - Journal of Parallel and …, 2013 - Elsevier

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics
Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and …

Gem Citer Citeret af 37 Relaterede artikler Alle 8 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arxiv preprint arxiv:1611.02717, 2016 - arxiv.org

In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

Gem Citer Citeret af 19 Relaterede artikler Alle 10 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] tu-dortmund.de

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

D Göddeke, M Altenbernd, D Ribbrock - Parallel Computing, 2015 - Elsevier

We analyse novel fault tolerance schemes for data loss in multigrid solvers, which
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …

Gem Citer Citeret af 25 Relaterede artikler Alle 8 versioner Bibliotekssøgning

[Free GPT-4]
[DeepSeek]

[PDF] osti.gov

Numerical analysis of fixed point algorithms in the presence of hardware faults

M Stoyanov, C Webster - SIAM Journal on Scientific Computing, 2015 - SIAM

The exponential growth of computational power of the extreme scale machines over the past
few decades has led to a corresponding decrease in reliability and a sharp increase of the …

Gem Citer Citeret af 28 Relaterede artikler Alle 7 versioner

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Cooperative application/OS DRAM fault recovery

Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Resilience design patterns: A structured approach to resilience at extreme scale

When is multi-version checkpointing needed?

Exploiting asynchrony from exact forward recovery for due in iterative solvers

Versioned distributed arrays for resilience in scientific applications: Global view resilience

Response of HPC hardware to neutron radiation at the dawn of exascale

A block-asynchronous relaxation method for graphics processing units

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

Numerical analysis of fixed point algorithms in the presence of hardware faults