An evaluation study on log parsing and its use in log mining
Logs, which record runtime information of modern systems, are widely utilized by developers
(and operators) in system development and maintenance. Due to the ever-increasing size of …
(and operators) in system development and maintenance. Due to the ever-increasing size of …
Towards automated log parsing for large-scale log data analysis
Logs are widely used in system management for dependability assurance because they are
often the only data available that record detailed system runtime behaviors in production …
often the only data available that record detailed system runtime behaviors in production …
Landscape of automated log analysis: A systematic literature review and map** study
Ł Korzeniowski, K Goczyła - IEEE Access, 2022 - ieeexplore.ieee.org
Logging is a common practice in software engineering to provide insights into working
systems. The main uses of log files have always been failure identification and root cause …
systems. The main uses of log files have always been failure identification and root cause …
[HTML][HTML] Log-based software monitoring: a systematic map** study
Modern software development and operations rely on monitoring to understand how
systems behave in production. The data provided by application logs and runtime …
systems behave in production. The data provided by application logs and runtime …
Lessons learned from the analysis of system failures at petascale: The case of blue waters
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
Desh: deep learning for system health prediction of lead times to failure in hpc
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …
likely to experience even higher fault rates due to increased component count and density …
Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs
This paper presents an in-depth characterization of the resiliency of more than 5 million HPC
application runs completed during the first 518 production days of Blue Waters, a 13.1 …
application runs completed during the first 518 production days of Blue Waters, a 13.1 …
Failure prediction for HPC systems and applications: Current situation and open issues
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …
Big data meets hpc log analytics: Scalable approach to understanding systems at extreme scale
Today's high-performance computing (HPC) systems are heavily instrumented, generating
logs containing information about abnormal events, such as critical conditions, faults, errors …
logs containing information about abnormal events, such as critical conditions, faults, errors …
Logdiver: A tool for measuring resilience of extreme-scale systems and applications
This paper presents LogDiver, a tool for the analysis of application-level resiliency in
extreme-scale computing systems. The tool has been implemented to handle data …
extreme-scale computing systems. The tool has been implemented to handle data …