It infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …

DA Bhanage, AV Pawar, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org
Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …

New frontiers in IoT: Networking, systems, reliability, and security challenges

S Bagchi, TF Abdelzaher, R Govindan… - IEEE Internet of …, 2020 - ieeexplore.ieee.org
The field of IoT has blossomed and is positively influencing many application domains. In
this article, we bring out the unique challenges this field poses to research in computer …

Adaptive fault diagnosis

Q Zhu, T Tung, Q **e - US Patent 9,298,525, 2016 - Google Patents
According to an example, an adaptive fault diagnosis system may include a memory storing
machine readable instructions to receive metrics and events from an enterprise system, and …

CloudPD: Problem determination and diagnosis in shared dynamic clouds

B Sharma, P Jayachandran, A Verma… - 2013 43rd Annual …, 2013 - ieeexplore.ieee.org
In this work, we address problem determination in virtualized clouds. We show that high
dynamism, resource sharing, frequent reconfiguration, high propensity to faults and …

AMPT-GA: automatic mixed precision floating point tuning for GPU applications

PV Kotipalli, R Singh, P Wood, I Laguna… - Proceedings of the ACM …, 2019 - dl.acm.org
Mixed precision computations improve high performance computing throughput for
applications that can tolerate decreased mathematical precision in their computations …

Network anomaly detection and identification based on deep learning methods

M Zhu, K Ye, CZ Xu - Cloud Computing–CLOUD 2018: 11th International …, 2018 - Springer
Network anomaly detection is the process of determining when network behavior has
deviated from the normal behavior. The detection of abnormal events in large dynamic …

An industrial case study of automatically identifying performance regression-causes

THD Nguyen, M Nagappan, AE Hassan… - Proceedings of the 11th …, 2014 - dl.acm.org
Even the addition of a single extra field or control statement in the source code of a large-
scale software system can lead to performance regressions. Such regressions can …

[PDF][PDF] A review on software fault detection and prevention mechanism in software development activities

B Dhanalaxmi, GA Naidu, K Anuradha - Journal of Computer …, 2015 - academia.edu
The need of distributed and complex commercial applications in enterprise demands error
free and quality application systems. This makes it extremely important in software …

Non-intrusive anomaly detection with streaming performance metrics and logs for DevOps in public clouds: a case study in AWS

D Sun, M Fu, L Zhu, G Li, Q Lu - IEEE transactions on Emerging …, 2016 - ieeexplore.ieee.org
Public clouds are a style of computing platforms, where scalable and elastic Information
Technology-enabled capabilities are provided as a service to external customers using …

Linking resource usage anomalies with system failures from cluster log data

E Chuah, A Jhumka… - 2013 IEEE 32nd …, 2013 - ieeexplore.ieee.org
Bursts of abnormally high use of resources are thought to be an indirect cause of failures in
large cluster systems, but little work has systematically investigated the role of high resource …