It infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …
Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …
of any element or connection results in downtime and triggers monetary and performance …
New frontiers in IoT: Networking, systems, reliability, and security challenges
The field of IoT has blossomed and is positively influencing many application domains. In
this article, we bring out the unique challenges this field poses to research in computer …
this article, we bring out the unique challenges this field poses to research in computer …
Adaptive fault diagnosis
Q Zhu, T Tung, Q **e - US Patent 9,298,525, 2016 - Google Patents
According to an example, an adaptive fault diagnosis system may include a memory storing
machine readable instructions to receive metrics and events from an enterprise system, and …
machine readable instructions to receive metrics and events from an enterprise system, and …
CloudPD: Problem determination and diagnosis in shared dynamic clouds
In this work, we address problem determination in virtualized clouds. We show that high
dynamism, resource sharing, frequent reconfiguration, high propensity to faults and …
dynamism, resource sharing, frequent reconfiguration, high propensity to faults and …
AMPT-GA: automatic mixed precision floating point tuning for GPU applications
Mixed precision computations improve high performance computing throughput for
applications that can tolerate decreased mathematical precision in their computations …
applications that can tolerate decreased mathematical precision in their computations …
Network anomaly detection and identification based on deep learning methods
Network anomaly detection is the process of determining when network behavior has
deviated from the normal behavior. The detection of abnormal events in large dynamic …
deviated from the normal behavior. The detection of abnormal events in large dynamic …
An industrial case study of automatically identifying performance regression-causes
Even the addition of a single extra field or control statement in the source code of a large-
scale software system can lead to performance regressions. Such regressions can …
scale software system can lead to performance regressions. Such regressions can …
[PDF][PDF] A review on software fault detection and prevention mechanism in software development activities
The need of distributed and complex commercial applications in enterprise demands error
free and quality application systems. This makes it extremely important in software …
free and quality application systems. This makes it extremely important in software …
Non-intrusive anomaly detection with streaming performance metrics and logs for DevOps in public clouds: a case study in AWS
Public clouds are a style of computing platforms, where scalable and elastic Information
Technology-enabled capabilities are provided as a service to external customers using …
Technology-enabled capabilities are provided as a service to external customers using …
Linking resource usage anomalies with system failures from cluster log data
Bursts of abnormally high use of resources are thought to be an indirect cause of failures in
large cluster systems, but little work has systematically investigated the role of high resource …
large cluster systems, but little work has systematically investigated the role of high resource …