A review of data centers energy consumption and reliability modeling

KMU Ahmed, MHJ Bollen, M Alvarez - IEEE access, 2021 - ieeexplore.ieee.org
Enhancing the efficiency and the reliability of the data center are the technical challenges for
maintaining the quality of services for the end-users in the data center operation. The energy …

Task failure prediction in cloud data centers using deep learning

J Gao, H Wang, H Shen - IEEE transactions on services …, 2020 - ieeexplore.ieee.org
A large-scale cloud data center needs to provide high service reliability and availability with
low failure occurrence probability. However, current large-scale cloud data centers still face …

A systematic map** study in AIOps

P Notaro, J Cardoso, M Gerndt - International Conference on Service …, 2020 - Springer
IT systems of today are becoming larger and more complex, rendering their human
supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Characteristics of co-allocated online services and batch jobs in internet data centers: a case study from Alibaba cloud

C Jiang, G Han, J Lin, G Jia, W Shi, J Wan - IEEE Access, 2019 - ieeexplore.ieee.org
In order to reduce power and energy costs, giant cloud providers now mix online and batch
jobs on the same cluster. Although the co-allocation of such jobs improves machine …

[HTML][HTML] Hora: Architecture-aware online failure prediction

T Pitakrat, D Okanović, A Van Hoorn… - Journal of Systems and …, 2018 - Elsevier
Complex software systems experience failures at runtime even though a lot of effort is put
into the development and operation. Reactive approaches detect these failures after they …

Cloud-native computing: A survey from the perspective of services

S Deng, H Zhao, B Huang, C Zhang… - Proceedings of the …, 2024 - ieeexplore.ieee.org
The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …

Towards thermal-aware workload distribution in cloud data centers based on failure models

J Li, Y Deng, Y Zhou, Z Zhang, G Min… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Increasing workload conditions lead to a significant surge in power consumption and
computing node failures in data centers. The existing workload distribution strategies …

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …