Assess and summarize: Improve outage understanding with large language models

P **, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

Monitorassistant: Simplifying cloud service monitoring via large language models

Z Yu, M Ma, C Zhang, S Qin, Y Kang, C Bansal… - … Proceedings of the …, 2024 - dl.acm.org
In large-scale cloud service systems, monitoring metric data and conducting anomaly
detection is an important way to maintain reliability and stability. However, great disparity …

Codec: Cost-effective duration prediction system for deadline scheduling in the cloud

H Li, M Ma, Y Liu, S Qin, B Qiao, R Yao… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
Modern cloud platforms allow customers to flexibly allocate or release computing resources.
One crucial scenario is how to drive existing VMs to a specific state by a given deadline in a …

Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning

H Li, M Ma, Y Liu, P Zhao, S Li, Z Li… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
In the rapidly expanding domain of cloud computing, a variety of software services have
been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on …

[PDF][PDF] Methods for the adaptive provisioning of resources to iterative batch jobs

D Scheinert - 2024 - depositonce.tu-berlin.de
In light of continuously growing amounts of data as well as the proliferation of machine
learning use cases, batch processing of data remains an important procedure. Batch …