- Academic Search

D Fiala, F Mueller, C Engelmann… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org

Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …

Simpan Kutip Dirujuk 393 kali Artikel terkait 30 versi

[Free GPT-4]
[DeepSeek]

[PDF] ncsu.edu

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

Simpan Kutip Dirujuk 210 kali Artikel terkait 20 versi

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Simpan Kutip Dirujuk 9 kali Artikel terkait 6 versi

[Free GPT-4]
[DeepSeek]

[PDF] christian-engelmann.info

[PDF][PDF] Redundant execution of HPC applications with MR-MPI

C Engelmann, S Böhm - Proceedings of the 10th IASTED …, 2011 - christian-engelmann.info

This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-
MPI, for transparently executing high-performance computing (HPC) applications in a …

Simpan Kutip Dirujuk 81 kali Artikel terkait 15 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

Reachability testing: An approach to testing concurrent software

GH Hwang, KC Tai, TL Huang - International Journal of Software …, 1995 - World Scientific

Concurrent programs are more difficult to test than sequential programs because of non-
deterministic behavior. An execution of a concurrent program non-deterministically …

Simpan Kutip Dirujuk 132 kali Artikel terkait 13 versi

[Free GPT-4]
[DeepSeek]

[PDF] academia.edu

[PDF][PDF] Process Migration for Resilient Applications

K McGill, S Taylor - Dartmouth College, 2011 - academia.edu

The notion of resiliency is concerned with constructing mission-critical distributed
applications that are able to operate through a wide variety of failures, errors, and malicious …

Simpan Kutip Dirujuk 56 kali Artikel terkait 2 versi Versi HTML

Proactive process-level live migration and back migration in HPC environments

C Wang, F Mueller, C Engelmann, SL Scott - Journal of Parallel and …, 2012 - Elsevier

As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

Simpan Kutip Dirujuk 62 kali Artikel terkait 6 versi

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org

This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

Simpan Kutip Dirujuk 18 kali Artikel terkait 18 versi

[Free GPT-4]
[DeepSeek]

[PDF] proquest.com

[BUKU][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com

Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

Simpan Kutip Dirujuk 43 kali Artikel terkait 6 versi Pencarian Perpustakaan

[Free GPT-4]
[DeepSeek]

[PDF] iisc.ac.in

Fault tolerance on large scale systems using adaptive process replication

C George, S Vadhiyar - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org

Exascale systems of the future are predicted to have mean time between failures (MTBF) of
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …

Simpan Kutip Dirujuk 29 kali Artikel terkait 7 versi

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

Volpexmpi: An MPI library for execution of parallel applications on volatile nodes

Detection and correction of silent data corruption for large-scale high-performance computing

Combining partial redundancy and checkpointing for HPC

Software approaches for resilience of high performance computing systems: a survey

[PDF][PDF] Redundant execution of HPC applications with MR-MPI

Reachability testing: An approach to testing concurrent software

[PDF][PDF] Process Migration for Resilient Applications

Proactive process-level live migration and back migration in HPC environments

Replication is more efficient than you think

[BUKU][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

Fault tolerance on large scale systems using adaptive process replication