Making disk failure predictions {SMARTer}!

S Lu, B Luo, T Patel, Y Yao, D Tiwari… - 18th USENIX Conference …, 2020 - usenix.org
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …

Nova-fortis: A fault-tolerant non-volatile main memory file system

J Xu, L Zhang, A Memaripour… - Proceedings of the 26th …, 2017 - dl.acm.org
Emerging fast, persistent memories will enable systems that combine conventional DRAM
with large amounts of non-volatile main memory (NVMM) and provide huge increases in …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

Remote data checking for network coding-based distributed storage systems

B Chen, R Curtmola, G Ateniese, R Burns - Proceedings of the 2010 …, 2010 - dl.acm.org
Remote Data Checking (RDC) is a technique by which clients can establish that data
outsourced at untrusted servers remains intact over time. RDC is useful as a prevention tool …

The tail at store: A revelation from millions of hours of disk and {SSD} deployments

M Hao, G Soundararajan… - … USENIX Conference on …, 2016 - usenix.org
We study storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an
overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage …

Perseus: A {Fail-Slow} detection framework for cloud storage systems

R Lu, E Xu, Y Zhang, F Zhu, Z Zhu, M Wang… - … USENIX Conference on …, 2023 - usenix.org
The newly-emerging''fail-slow''failures plague both software and hardware where the victim
components are still functioning yet with degraded performance. To address this problem …

Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults

A Ganesan, R Alagappan, AC Arpaci-Dusseau… - ACM Transactions on …, 2017 - dl.acm.org
We analyze how modern distributed storage systems behave in the presence of file-system
faults such as data corruption and read and write errors. We characterize eight popular …

Pangolin: A {Fault-Tolerant} persistent memory programming library

L Zhang, S Swanson - … Annual Technical Conference (USENIX ATC 19), 2019 - usenix.org
Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-
based data structures that can offer substantial performance gains over conventional …

Enabling data integrity protection in regenerating-coding-based cloud storage: Theory and implementation

HCH Chen, PPC Lee - IEEE transactions on parallel and …, 2013 - ieeexplore.ieee.org
To protect outsourced data in cloud storage against corruptions, adding fault tolerance to
cloud storage, along with efficient data integrity checking and recovery procedures …

Failure analysis of virtual and physical machines: Patterns, causes and characteristics

R Birke, I Giurgiu, LY Chen… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
In today's commercial data centers, the computation density grows continuously as the
number of hardware components and workloads in units of virtual machines increase. The …