Memory errors in modern systems: The good, the bad, and the ugly

V Sridharan, N DeBardeleben, S Blanchard… - ACM SIGARCH …, 2015 - dl.acm.org
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …

DRAM errors in the wild: a large-scale field study

B Schroeder, E Pinheiro, WD Weber - ACM SIGMETRICS Performance …, 2009 - dl.acm.org
Errors in dynamic random access memory (DRAM) are a common form of hardware failure
in modern compute clusters. Failures are costly both in terms of hardware replacement costs …

Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field

J Meza, Q Wu, S Kumar, O Mutlu - 2015 45th Annual IEEE/IFIP …, 2015 - ieeexplore.ieee.org
Computing systems use dynamic random-access memory (DRAM) as main memory. As
prior works have shown, failures in DRAM devices are an important source of errors in …

A study of DRAM failures in the field

V Sridharan, D Liberty - SC'12: Proceedings of the International …, 2012 - ieeexplore.ieee.org
Most modern computer systems use dynamic random access memory (DRAM) as a main
memory store. Recent publications have confirmed that DRAM errors are a common source …

AVATAR: A variable-retention-time (VRT) aware refresh for DRAM systems

MK Qureshi, DH Kim, S Khan, PJ Nair… - 2015 45th Annual …, 2015 - ieeexplore.ieee.org
Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to
reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention …

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

AA Hwang, IA Stefanovici, B Schroeder - ACM SIGPLAN Notices, 2012 - dl.acm.org
Main memory is one of the leading hardware causes for machine crashes in today's
datacenters. Designing, evaluating and modeling systems that are resilient against memory …

Design-induced latency variation in modern DRAM chips: Characterization, analysis, and latency reduction mechanisms

D Lee, S Khan, L Subramanian, S Ghose… - Proceedings of the …, 2017 - dl.acm.org
Variation has been shown to exist across the cells within a modern DRAM chip. Prior work
has studied and exploited several forms of variation, such as manufacturing-process-or …

Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults

V Sridharan, J Stearley, N DeBardeleben… - Proceedings of the …, 2013 - dl.acm.org
Several recent publications confirm that faults are common in high-performance computing
systems. Therefore, further attention to the faults experienced by such computing systems is …

A systematic study of ddr4 dram faults in the field

MV Beigi, Y Cao, S Gurumurthi… - … Symposium on High …, 2023 - ieeexplore.ieee.org
This paper presents a study of DDR4 DRAM faults in a large fleet of commodity servers,
covering several billion memory device-hours of data. The goal of this study is to understand …

A unified coded deep neural network training strategy based on generalized polydot codes

S Dutta, Z Bai, H Jeong, TM Low… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
This paper has two main contributions. First, we propose a novel coding technique-
Generalized PolyDot-for matrix-vector products that advances on existing techniques for …