Research and Technologies for next-generation high-temperature data centers–State-of-the-arts and future perspectives

Y Zhang, K Shan, X Li, H Li, S Wang - Renewable and Sustainable Energy …, 2023 - Elsevier
Data centers have attracted increasing attention worldwide over the last decades due to
their high energy consumption. Cooling accounts for about 30–40% of the total energy …

A survey on automated log analysis for reliability engineering

S He, P He, Z Chen, T Yang, Y Su, MR Lyu - ACM computing surveys …, 2021 - dl.acm.org
Logs are semi-structured text generated by logging statements in software source code. In
recent decades, software logs have become imperative in the reliability assurance …

Task failure prediction in cloud data centers using deep learning

J Gao, H Wang, H Shen - IEEE transactions on services …, 2020 - ieeexplore.ieee.org
A large-scale cloud data center needs to provide high service reliability and availability with
low failure occurrence probability. However, current large-scale cloud data centers still face …

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X **… - Proceedings of the 29th …, 2023 - dl.acm.org
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

A review of data centers energy consumption and reliability modeling

KMU Ahmed, MHJ Bollen, M Alvarez - IEEE access, 2021 - ieeexplore.ieee.org
Enhancing the efficiency and the reliability of the data center are the technical challenges for
maintaining the quality of services for the end-users in the data center operation. The energy …

Workflowsim: A toolkit for simulating scientific workflows in distributed environments

W Chen, E Deelman - … IEEE 8th international conference on E …, 2012 - ieeexplore.ieee.org
Simulation is one of the most popular evaluation methods in scientific workflow studies.
However, existing workflow simulators fail to provide a framework that takes into …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Memory errors in modern systems: The good, the bad, and the ugly

V Sridharan, N DeBardeleben, S Blanchard… - ACM SIGARCH …, 2015 - dl.acm.org
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …

Exascale computing and big data

DA Reed, J Dongarra - Communications of the ACM, 2015 - dl.acm.org
Exascale computing and big data Page 1 56 COMMUNICATIONS OF THE ACM | JULY
2015 | VOL. 58 | NO. 7 contributed articles ILL US TRA TION B Y PETER BOLLINGER DOI:10.1145/2699414 …