A survey of machine learning for computer architecture and systems
It has been a long time that computer architecture and systems are optimized for efficient
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …
A survey of aiops methods for failure management
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …
The increase in scale and complexity of these systems challenges O&M teams that perform …
Making disk failure predictions {SMARTer}!
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …
pose challenges for accurate failure prediction. In this work, we present analysis and …
Disk failure prediction in data centers via online learning
Disk failure has become a major concern with the rapid expansion of storage systems in
data centers. Based on SMART (Self-Monitoring, Analysis and Reporting Technology) …
data centers. Based on SMART (Self-Monitoring, Analysis and Reporting Technology) …
Improving service availability of cloud systems by predicting disk error
High service availability is crucial for cloud systems. A typical cloud system uses a large
number of physical hard disk drives. Disk errors are one of the most important reasons that …
number of physical hard disk drives. Disk errors are one of the most important reasons that …
Lessons and actions: What we learned from 10k {SSD-Related} storage system failures
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high
performance and low energy cost. However, SSD introduces more complex failure modes …
performance and low energy cost. However, SSD introduces more complex failure modes …
Cluster storage systems gotta have {HeART}: improving storage efficiency by exploiting disk-reliability heterogeneity
Large-scale cluster storage systems typically consist of a heterogeneous mix of storage
devices with significantly varying failure rates. Despite such differences among devices …
devices with significantly varying failure rates. Despite such differences among devices …
Multi-view feature-based {SSD} failure prediction: What, when, and why
Y Zhang, W Hao, B Niu, K Liu, S Wang, N Liu… - … USENIX Conference on …, 2023 - usenix.org
Solid state drives (SSDs) play an important role in large-scale data centers. SSD failures
affect the stability of storage systems and cause additional maintenance overhead. To …
affect the stability of storage systems and cause additional maintenance overhead. To …
An empirical study of the impact of data splitting decisions on the performance of AIOps solutions
AIOps (Artificial Intelligence for IT Operations) leverages machine learning models to help
practitioners handle the massive data produced during the operations of large-scale …
practitioners handle the massive data produced during the operations of large-scale …
Tiger:{Disk-Adaptive} redundancy without placement restrictions
Large-scale cluster storage systems use redundancy (via erasure coding) to ensure data
durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to …
durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to …