System-level hardware failure prediction using deep learning

X Sun, K Chakrabarty, R Huang, Y Chen… - Proceedings of the 56th …, 2019 - dl.acm.org
Disk and memory faults are the leading causes of server breakdown. A proactive solution is
to predict such hardware failure at the runtime and then isolate the hardware at risk and …

Dram failure prediction in aiops: Empirical evaluation, challenges and opportunities

Z Wu, H Xu, G Pang, F Yu, Y Wang, S Jian… - arxiv preprint arxiv …, 2021 - arxiv.org
DRAM failure prediction is a vital task in AIOps, which is crucial to maintain the reliability and
sustainable service of large-scale data centers. However, limited work has been done on …

An in-depth correlative study between dram errors and server failures in production data centers

Z Cheng, S Han, PPC Lee, X Li… - 2022 41st International …, 2022 - ieeexplore.ieee.org
Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures
in production data centers. However, little is known about the correlation between DRAM …

A case for transparent reliability in DRAM systems

M Patel, T Shahroodi, A Manglik, AG Yaglikci… - arxiv preprint arxiv …, 2022 - arxiv.org
Today's systems have diverse needs that are difficult to address using one-size-fits-all
commodity DRAM. Unfortunately, although system designers can theoretically adapt …

A survey on AI for storage

Y Liu, H Wang, K Zhou, CH Li, R Wu - CCF Transactions on High …, 2022 - Springer
Storage, as a core function and fundamental component of computers, provides services for
saving and reading digital data. The increasing complexity of data operations and storage …

Cost-aware prediction of uncorrected DRAM errors in the field

I Boixaderas, D Zivanovic, S Moré… - … Conference for High …, 2020 - ieeexplore.ieee.org
This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading
cause of hardware failures in large-scale HPC clusters. The method uses a random forest …

From correctable memory errors to uncorrectable memory errors: What error bits tell

C Li, Y Zhang, J Wang, H Chen, X Liu… - … Conference for High …, 2022 - ieeexplore.ieee.org
Uncorrectable memory errors are one of the major failure causes in datacenters. In this
paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable …

Himfp: Hierarchical intelligent memory failure prediction for cloud service reliability

Q Yu, W Zhang, P Notaro, S Haeri… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org
In large-scale datacenters, memory failure is one of the leading causes of server crashes,
and uncorrectable error (UCE) is the major fault type indicating defects of memory modules …

Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data

X Du, C Li, S Zhou, M Ye, J Li - 2020 16th European …, 2020 - ieeexplore.ieee.org
Uncorrectable memory errors are the leading causes of server failures in datacenters.
Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information …

Workload-aware dram error prediction using machine learning

L Mukhanov, K Tovletoglou… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
The aggressive scaling of technology may have helped to meet the growing demand for
higher memory capacity and density, but has also made DRAM cells more prone to errors …