An evaluation study on log parsing and its use in log mining

P He, J Zhu, S He, J Li, MR Lyu - 2016 46th annual IEEE/IFIP …, 2016 - ieeexplore.ieee.org
Logs, which record runtime information of modern systems, are widely utilized by developers
(and operators) in system development and maintenance. Due to the ever-increasing size of …

Towards automated log parsing for large-scale log data analysis

P He, J Zhu, S He, J Li, MR Lyu - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Logs are widely used in system management for dependability assurance because they are
often the only data available that record detailed system runtime behaviors in production …

Landscape of automated log analysis: A systematic literature review and map** study

Ł Korzeniowski, K Goczyła - IEEE Access, 2022 - ieeexplore.ieee.org
Logging is a common practice in software engineering to provide insights into working
systems. The main uses of log files have always been failure identification and root cause …

[HTML][HTML] Log-based software monitoring: a systematic map** study

J Cândido, M Aniche, A Van Deursen - PeerJ Computer Science, 2021 - peerj.com
Modern software development and operations rely on monitoring to understand how
systems behave in production. The data provided by application logs and runtime …

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs

C Di Martino, W Kramer, Z Kalbarczyk… - 2015 45th Annual IEEE …, 2015 - ieeexplore.ieee.org
This paper presents an in-depth characterization of the resiliency of more than 5 million HPC
application runs completed during the first 518 production days of Blue Waters, a 13.1 …

Failure prediction for HPC systems and applications: Current situation and open issues

A Gainaru, F Cappello, M Snir… - … International journal of …, 2013 - journals.sagepub.com
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …

Big data meets hpc log analytics: Scalable approach to understanding systems at extreme scale

BH Park, S Hukerikar, R Adamson… - … on Cluster Computing …, 2017 - ieeexplore.ieee.org
Today's high-performance computing (HPC) systems are heavily instrumented, generating
logs containing information about abnormal events, such as critical conditions, faults, errors …

Logdiver: A tool for measuring resilience of extreme-scale systems and applications

CD Martino, S Jha, W Kramer, Z Kalbarczyk… - Proceedings of the 5th …, 2015 - dl.acm.org
This paper presents LogDiver, a tool for the analysis of application-level resiliency in
extreme-scale computing systems. The tool has been implemented to handle data …