- Academic Search

F Cappello, A Geist, W Gropp, S Kale… - Supercomputing …, 2014 - superfri.org

Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Lagre Referanse Sitert av 436 Beslektede artikler Alle 14 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] osti.gov

Design, modeling, and evaluation of a scalable multi-level checkpointing system

A Moody, G Bronevetsky, K Mohror… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org

High-performance computing (HPC) systems are growing more powerful by utilizing more
hardware components. As the system mean-time-before-failure correspondingly drops …

Lagre Referanse Sitert av 835 Beslektede artikler Alle 12 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] ucr.edu

Hot sax: Efficiently finding the most unusual time series subsequence

E Keogh, J Lin, A Fu - … Conference on Data Mining (ICDM'05), 2005 - ieeexplore.ieee.org

In this work, we introduce the new problem of finding time series discords. Time series
discords are subsequences of a longer time series that are maximally different to all the rest …

Lagre Referanse Sitert av 1215 Beslektede artikler Alle 15 versjoner

Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Z Chen - ACM SIGPLAN Notices, 2013 - dl.acm.org

Soft errors are one-time events that corrupt the state of a computing system but not its overall
functionality. Large supercomputers are especially susceptible to soft errors because of their …

Lagre Referanse Sitert av 224 Beslektede artikler Alle 2 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Condition numbers of Gaussian random matrices

Z Chen, JJ Dongarra - SIAM Journal on Matrix Analysis and Applications, 2005 - SIAM

Let G_m*n be an m*n real random matrix whose elements are independent and identically
distributed standard normal random variables, and let \kappa_2(G_m*n) be the 2-norm …

Lagre Referanse Sitert av 272 Beslektede artikler Alle 24 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] ucr.edu

Algorithm-based fault tolerance for fail-stop failures

Z Chen, J Dongarra - IEEE Transactions on Parallel and …, 2008 - ieeexplore.ieee.org

Fail-stop failures in distributed environments are often tolerated by checkpointing or
message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix …

Lagre Referanse Sitert av 176 Beslektede artikler Alle 21 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] ucr.edu

High performance linpack benchmark: a fault tolerant implementation without checkpointing

T Davies, C Karlsson, H Liu, C Ding… - Proceedings of the …, 2011 - dl.acm.org

The probability that a failure will occur before the end of the computation increases as the
number of processors used in a high performance computing application increases. For long …

Lagre Referanse Sitert av 147 Beslektede artikler Alle 4 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] ucr.edu

Algorithm-based recovery for iterative methods without checkpointing

Z Chen - Proceedings of the 20th international symposium on …, 2011 - dl.acm.org

In today's high performance computing practice, fail-stop failures are often tolerated by
checkpointing. While checkpointing is a very general technique and can often be applied to …

Lagre Referanse Sitert av 140 Beslektede artikler Alle 5 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] psu.edu

The reliability wall for exascale supercomputing

X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …

Lagre Referanse Sitert av 113 Beslektede artikler Alle 10 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Z Chen, J Dongarra - Proceedings 20th IEEE International …, 2006 - ieeexplore.ieee.org

As the size of today's high performance computers increases from hundreds, to thousands,
and even tens of thousands of processors, node failures in these computers are becoming …

Lagre Referanse Sitert av 117 Beslektede artikler Alle 17 versjoner

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Fault tolerant high performance computing by a coding approach

[PDF][PDF] Toward exascale resilience: 2014 update

Design, modeling, and evaluation of a scalable multi-level checkpointing system

Hot sax: Efficiently finding the most unusual time series subsequence

Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Condition numbers of Gaussian random matrices

Algorithm-based fault tolerance for fail-stop failures

High performance linpack benchmark: a fault tolerant implementation without checkpointing

Algorithm-based recovery for iterative methods without checkpointing

The reliability wall for exascale supercomputing

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources