Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Detection and correction of silent data corruption for large-scale high-performance computing
Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …
Combining partial redundancy and checkpointing for HPC
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …
floating point operations per second) and exascale systems are projected within seven …
Software approaches for resilience of high performance computing systems: a survey
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …
has been descending continuously. Therefore, system resilience has been regarded as one …
[PDF][PDF] Redundant execution of HPC applications with MR-MPI
This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-
MPI, for transparently executing high-performance computing (HPC) applications in a …
MPI, for transparently executing high-performance computing (HPC) applications in a …
Reachability testing: An approach to testing concurrent software
GH Hwang, KC Tai, TL Huang - International Journal of Software …, 1995 - World Scientific
Concurrent programs are more difficult to test than sequential programs because of non-
deterministic behavior. An execution of a concurrent program non-deterministically …
deterministic behavior. An execution of a concurrent program non-deterministically …
[PDF][PDF] Process Migration for Resilient Applications
K McGill, S Taylor - Dartmouth College, 2011 - academia.edu
The notion of resiliency is concerned with constructing mission-critical distributed
applications that are able to operate through a wide variety of failures, errors, and malicious …
applications that are able to operate through a wide variety of failures, errors, and malicious …
Proactive process-level live migration and back migration in HPC environments
As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …
Replication is more efficient than you think
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …
enables the application to survive many fail-stop errors, thereby allowing for longer …
[BUKU][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems
J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
Fault tolerance on large scale systems using adaptive process replication
C George, S Vadhiyar - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org
Exascale systems of the future are predicted to have mean time between failures (MTBF) of
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …