A review of data centers energy consumption and reliability modeling
KMU Ahmed, MHJ Bollen, M Alvarez - IEEE access, 2021 - ieeexplore.ieee.org
Enhancing the efficiency and the reliability of the data center are the technical challenges for
maintaining the quality of services for the end-users in the data center operation. The energy …
maintaining the quality of services for the end-users in the data center operation. The energy …
Task failure prediction in cloud data centers using deep learning
A large-scale cloud data center needs to provide high service reliability and availability with
low failure occurrence probability. However, current large-scale cloud data centers still face …
low failure occurrence probability. However, current large-scale cloud data centers still face …
A systematic map** study in AIOps
IT systems of today are becoming larger and more complex, rendering their human
supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed …
supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed …
A survey of aiops methods for failure management
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …
The increase in scale and complexity of these systems challenges O&M teams that perform …
Characteristics of co-allocated online services and batch jobs in internet data centers: a case study from Alibaba cloud
In order to reduce power and energy costs, giant cloud providers now mix online and batch
jobs on the same cluster. Although the co-allocation of such jobs improves machine …
jobs on the same cluster. Although the co-allocation of such jobs improves machine …
[HTML][HTML] Hora: Architecture-aware online failure prediction
Complex software systems experience failures at runtime even though a lot of effort is put
into the development and operation. Reactive approaches detect these failures after they …
into the development and operation. Reactive approaches detect these failures after they …
Cloud-native computing: A survey from the perspective of services
The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …
native computing. Cloud-native computing, as the most influential development principle for …
Towards thermal-aware workload distribution in cloud data centers based on failure models
Increasing workload conditions lead to a significant surge in power consumption and
computing node failures in data centers. The existing workload distribution strategies …
computing node failures in data centers. The existing workload distribution strategies …
Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …
Adaptive impact-driven detection of silent data corruption for HPC applications
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …
problems because there is no indication that there are errors during the execution. We …