Survey on redundancy based-fault tolerance methods for processors and hardware accelerators-trends in quantum computing, heterogeneous systems and reliability

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in CMOS technology since the late 1990s has increased the vulnerability of
processors toward faults. Subsequently, the focus of computer architects has shifted toward …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

[หนังสือ][B] Architecture design for soft errors

S Mukherjee - 2011 - books.google.com
Architecture Design for Soft Errors provides a comprehensive description of the architectural
techniques to tackle the soft error problem. It covers the new methodologies for quantitative …

Understanding the propagation of hard errors to software and implications for resilient system design

ML Li, P Ramachandran, SK Sahoo, SV Adve… - ACM Sigplan …, 2008 - dl.acm.org
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-
the-field faults. To be broadly deployable, the hardware reliability solution must incur low …

Configurable isolation: building high availability systems with commodity multi-core processors

N Aggarwal, P Ranganathan, NP Jouppi… - ACM SIGARCH …, 2007 - dl.acm.org
High availability is an increasingly important requirement for enterprise systems, often
valued more than performance. Systems designed for high availability typically use …

Architectural semantics for practical transactional memory

A McDonald, JW Chung, BD Carlstrom… - ACM SIGARCH …, 2006 - dl.acm.org
Transactional Memory (TM) simplifies parallel programming by allowing for parallel
execution of atomic tasks. Thus far, TM systems have focused on implementing transactional …

Architectures for online error detection and recovery in multicore processors

D Gizopoulos, M Psarakis, SV Adve… - … , Automation & Test …, 2011 - ieeexplore.ieee.org
The huge investment in the design and production of multicore processors may be put at risk
because the emerging highly miniaturized but unreliable fabrication technologies will …

Using likely program invariants to detect hardware errors

SK Sahoo, ML Li, P Ramachandran… - … and Networks With …, 2008 - ieeexplore.ieee.org
In the near future, hardware is expected to become increasingly vulnerable to faults due to
continuously decreasing feature size. Software-level symptoms have previously been used …

mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems

SK Sastry Hari, ML Li, P Ramachandran… - Proceedings of the …, 2009 - dl.acm.org
Continued technology scaling is resulting in systems with billions of devices. Unfortunately,
these devices are prone to failures from various sources, resulting in even commodity …

Software-based online detection of hardware defects mechanisms, architectural support, and evaluation

K Constantinides, O Mutlu, T Austin… - 40th Annual IEEE …, 2007 - ieeexplore.ieee.org
As silicon process technology scales deeper into the nanometer regime, hardware defects
are becoming more common. Such defects are bound to hinder the correct operation of …