Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

[BOOK][B] Parallel programming for modern high performance computing systems

P Czarnul - 2018 - books.google.com
In view of the growing presence and popularity of multicore and manycore processors,
accelerators, and coprocessors, as well as clusters using such computing devices, the …

[PDF][PDF] Static analysis-based approaches for secure software development

M Siavvas, E Gelenbe, D Kehagias… - Security in Computer …, 2018 - library.oapen.org
Software security is a matter of major concern for software development enterprises that
wish to deliver highly secure software products to their customers. Static analysis is …

CRUM: Checkpoint-restart support for CUDA's unified memory

R Garg, A Mohan, M Sullivan… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …

Checkpoint restart support for heterogeneous hpc applications

K Parasyris, K Keller… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
As we approach the era of exa-scale computing, fault tolerance is of growing importance.
The increasing number of cores as well as the increased complexity of modern …

CRAC: checkpoint-restart architecture for CUDA with streams and UVM

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing

R Garg, G Price, G Cooperman - … of the 28th international symposium on …, 2019 - dl.acm.org
Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing
problem in HPC. The problem has been complicated by the need to provide checkpoint …

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

Distributed configuration, authorization and management in the cloud-based internet of things

M Henze, B Wolters, R Matzutt… - 2017 IEEE Trustcom …, 2017 - ieeexplore.ieee.org
Network-based deployments within the Internet of Things increasingly rely on the cloud-
controlled federation of individual networks to configure, authorize, and manage devices …

Capturing snapshots of offload applications on many-core coprocessors

CH Li, G Coviello, S Chakradhar, A Rezaei - US Patent 10,678,550, 2020 - Google Patents
Methods are provided. A method includes capturing a snap shot of an offload process being
executed by one or more many-core processors. The offload process is in signal …