Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

Assessing the use cases of persistent memory in high-performance scientific computing

Y Fridman, Y Snir, M Rusanovsky, K Zvi… - 2021 IEEE/ACM 11th …, 2021 - ieeexplore.ieee.org
As the High Performance Computing (HPC) world moves towards the Exa-Scale era, huge
amounts of data should be analyzed, manipulated and stored. In the traditional stor …

Checkpointing vs. supervision resilience approaches for dynamic independent tasks

J Posner, L Reitz, C Fohry - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

Application-based fault tolerance for numerical linear algebra at large scale

DA Torres González - European Conference on Parallel Processing, 2021 - Springer
Large scale architectures provide us with high computing power, but as the size of the
systems grows, computation units are more likely to fail. Fault-tolerant mechanisms have …

[ALINTI][C] 마이크로 배치 스트리밍 시스템에서 멀티 체크포인팅 기법을 이용한 성능 향상

박규리, 박성용 - 한국정보과학회 학술발표논문집, 2023 - dbpia.co.kr
요 약현재 빅데이터 환경에서 상태 기반 실시간 스트리밍 처리를 위해, LSM-tree 기반의 키-값
저장소가 스트리밍 시스템의 상태 저장소로 도입되었다. 마이크로 배치 스트리밍 시스템에서는 …