Google 학술 검색

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com

We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

저장 인용 539회 인용 관련 학술자료 전체 20개의 버전

[Free GPT-4]

[PDF] utk.edu

[책][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer

This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

저장 인용 269회 인용 관련 학술자료 전체 22개의 버전 도서관 검색

[Free GPT-4]

[PDF] nist.gov

Reliability in grid computing systems

C Dabrowski - Concurrency and Computation: Practice and …, 2009 - Wiley Online Library

In recent years, grid technology has emerged as an important tool for solving compute‐
intensive problems within the scientific community and in industry. To further the …

저장 인용 124회 인용 관련 학술자료 전체 7개의 버전

[Free GPT-4]

[PDF] unipd.it

Cloak and dagger: from two permissions to complete control of the UI feedback loop

Y Fratantonio, C Qian, SP Chung… - 2017 IEEE Symposium …, 2017 - ieeexplore.ieee.org

The effectiveness of the Android permission system fundamentally hinges on the user's
correct understanding of the capabilities of the permissions being granted. In this paper, we …

저장 인용 163회 인용 관련 학술자료 전체 9개의 버전

[Free GPT-4]

[PDF] hal.science

Uncoordinated checkpointing without domino effect for send-deterministic MPI applications

A Guermouche, T Ropars, E Brunet… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org

As reported by many recent studies, the mean time between failures of future post-petascale
supercomputers is likely to reduce, compared to the current situation. The most popular fault …

저장 인용 183회 인용 관련 학술자료 전체 16개의 버전

[Free GPT-4]

[PDF] rutgers.edu

Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org

Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

저장 인용 130회 인용 관련 학술자료 전체 8개의 버전

[Free GPT-4]

[PDF] hal.science

Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

JJ Dongarra, E Jeannot, E Saule, Z Shi - Proceedings of the nineteenth …, 2007 - dl.acm.org

We tackle the problem of scheduling task graphs onto a heterogeneous set of machines,
where each processor has a probability of failure governed by an exponential law. The goal …

저장 인용 186회 인용 관련 학술자료 전체 36개의 버전

[Free GPT-4]

[PDF] sciencedirect.com

Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier

The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

저장 인용 59회 인용 관련 학술자료 전체 7개의 버전

[Free GPT-4]

[PDF] psu.edu

The reliability wall for exascale supercomputing

X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …

저장 인용 111회 인용 관련 학술자료 전체 9개의 버전

[Free GPT-4]

[PDF] christian-engelmann.de

[PDF][PDF] The case for modular redundancy in large-scale high performance computing systems

C Engelmann, HH Ong… - Proceedings of the 8th …, 2009 - christian-engelmann.de

Recent investigations into resilience of large-scale highperformance computing (HPC)
systems showed a continuous trend of decreasing reliability and availability. Newly installed …

알림 만들기

인용

고급 검색

라이브러리에 저장됨

MPICH-V project: A multiprotocol automatic fault-tolerant MPI

Addressing failures in exascale computing

[책][B] Fault tolerance techniques for high-performance computing

Reliability in grid computing systems

Cloak and dagger: from two permissions to complete control of the UI feedback loop

Uncoordinated checkpointing without domino effect for send-deterministic MPI applications

Exploring automatic, online failure recovery for scientific applications at extreme scales

Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

Fault tolerance of MPI applications in exascale systems: The ULFM solution

The reliability wall for exascale supercomputing

[PDF][PDF] The case for modular redundancy in large-scale high performance computing systems