Google 학술 검색

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com

We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

저장 인용 539회 인용 관련 학술자료 전체 20개의 버전

[Free GPT-4]

[PDF] utk.edu

[책][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer

This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

저장 인용 269회 인용 관련 학술자료 전체 22개의 버전 도서관 검색

[Free GPT-4]

[PDF] psu.edu

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com

As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

저장 인용 276회 인용 관련 학술자료 전체 8개의 버전

The SIMNET virtual world architecture

J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org

Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …

저장 인용 299회 인용 관련 학술자료 전체 6개의 버전

[Free GPT-4]

[PDF] osti.gov

A user-level infiniband-based file system and checkpoint strategy for burst buffers

K Sato, K Mohror, A Moody, T Gamblin… - 2014 14th IEEE/ACM …, 2014 - ieeexplore.ieee.org

Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-
performance computing applications that run continuously for hours or days at a time …

저장 인용 84회 인용 관련 학술자료 전체 8개의 버전

[Free GPT-4]

[PDF] hal.science

Unified model for assessing checkpointing protocols at extreme‐scale

G Bosilca, A Bouteiller, E Brunet… - Concurrency and …, 2014 - Wiley Online Library

In this paper, we present a unified model for several well‐known checkpoint/restart
protocols. The proposed model is generic enough to encompass both extremes of the …

저장 인용 84회 인용 관련 학술자료 전체 28개의 버전

[Free GPT-4]

[PDF] sciencedirect.com

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier

The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

저장 인용 34회 인용 관련 학술자료 전체 7개의 버전

[Free GPT-4]

[PDF] googleapis.com

Systems and methods for fault tolerant communications

R Knight - US Patent 9,424,149, 2016 - Google Patents

Apparatuses, systems and methods are disclosed for tolerating fault in a communications
grid. Specifically, various techniques and systems are provided for detecting a fault or failure …

저장 인용 48회 인용 관련 학술자료 전체 4개의 버전 저장된 페이지

[Free GPT-4]

[PDF] academia.edu

Efficient synchronization under global EDF scheduling on multiprocessors

UMC Devi, H Leontyev… - … Euromicro Conference on …, 2006 - ieeexplore.ieee.org

We consider coordinating accesses to shared data structures in multiprocessor real-time
systems scheduled under preemptive global EDF. To our knowledge, prior work on global …

저장 인용 99회 인용 관련 학술자료 전체 13개의 버전

[Free GPT-4]

[PDF] hal.science

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …

저장 인용 75회 인용 관련 학술자료 전체 17개의 버전

알림 만들기

인용

고급 검색

라이브러리에 저장됨

Correlated set coordination in fault tolerant message logging protocols

Addressing failures in exascale computing

[책][B] Fault tolerance techniques for high-performance computing

Post-failure recovery of MPI communication capability: Design and rationale

The SIMNET virtual world architecture

A user-level infiniband-based file system and checkpoint strategy for burst buffers

Unified model for assessing checkpointing protocols at extreme‐scale

Local rollback for resilient MPI applications with application-level checkpointing and message logging

Systems and methods for fault tolerant communications

Efficient synchronization under global EDF scheduling on multiprocessors

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications