Label-Free Metric for Log Parser Evaluation

The replication package for "A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation".

Introduction

PMSS (parser medoid silhouette score) is a novel label-free template-level log parsing metric. It evaluates both parser grouping and template quality with medoid silhouette analysis and Levenshtein distance within a linear time complexity in general. To highlight its relationship with label-based template-level metrics (i.e., FGA and FTA), we compared their evaluation outcomes on the standard corrected Loghub 2.0 dataset. According to the results, log parsers with the highest PMSS have 2.1% average relative difference with the optimal FGA, and 9.8% with optimal FTA. PMSS is also significantly ($p<1e^{-8}$) and positively related to both FGA and FTA: the correlations are respectively 0.648 and 0.587. Our label-free metric provides a valuable evaluation alternative when ground-truths are inconsistent or no labeled data is available.

Dependencies

python>=3.8
chardet==5.1.0
ipython==8.12.0
matplotlib==3.7.2
natsort==8.4.0
numpy==1.24.4
pandas==2.0.3
regex==2022.3.2
scipy
tqdm==4.65.0
rpy2
editdistance

Dataset

The original Loghub 2.0 data is downloaded directly from Zenodo. After obtaining the original Loghub 2.0 dataset (ground-truth V1), store the files under the full_dataset folder and run template_correction.py to create the remaining four versions of corrected ground-truths. (V2: Standard corrected Loghub 2.0; V3: LogBatcher; V4: UNLEASH; V5: LUNAR).

RQ1: How do inconsistencies among ground-truth versions of log data influence the reliability of log parsing evaluation results?

Findings

As shown in the following table, correction rules are implemented differently across versions. Further, although two rules (DV and CV) are used in all functions, their implementation details are different.

We also investigated the template difference ratio at the template and message levels. Given that the correction rules are largely enriched in LUNAR, its ground-truth version largely differs from the other four versions on both template and message levels.

The min-max score differences among ground-truth versions for each parser on every dataset are shown in the following table. The corrections do not affect GA and FGA. However, parsers’ PA and FTA can largely vary across different ground-truth versions, and parsers with better parsing performances (i.e., LogBatcher and LUNAR) are more sensitive to changes.

We also found that the discrepancy in PA and FTA scores across versions will lead to different optimal tools on the same dataset. The optimal tool inconsistencies caused by score value shifts make it difficult to compare parser effectiveness and make selections.

Replicating the results

After obtaining all ground-truth versions, run code template_differences to compare the template differences on the template and message level. Then, get the parsing results with one of the following options:

Option 1: The parsing results of the six log parsers (Drain, Preprocessed-Drain, LILAC, LibreLog, LogBatcher, and LUNAR) evaluated in our study are available on Zenodo. To directly evaluate these results, download the files and unzip them to the result folder. Afterward, change directory to benchmark and run run_all_full.sh.

Option 2: Run the parsers with their official code under default settings to parse Loghub 2.0. After obtaining the results, move them to the result folder and run run_all_full.sh under the benchmark directory.

!!! IMPORTANT !!! We included 4 LLM-based log parsers in this study. Due to the stochastic nature of LLMs, you may obtain different evaluation outcomes if the parsing results are obtained with Option 2.

Run label_based_metrics_comparison.py to obtain the comparison results in RQ1.

RQ2: What is the relationship between PMSS and FGA-FTA?

Findings

The FGA-FTA obtained from the standard corrected Loghub 2.0 ground-truth and PMSS of each parser are shown in the following table. PMSS selects the same optimal parser as FGA in 7 out of the 14 studied datasets, and log parsers achieving the highest PMSS or FGA differ by only 2.1% on average in terms of FGA. On the other hand, given that PMSS is calculated with semantical-structural similarity instead of boolean identicalness, it selects the same optimal parser as FTA on only 3 datasets. Differences in evaluation strategies lead to some divergent patterns between PMSS, FGA, and FTA.

According to Spearman’s $\rho$ analyses, PMSS shows strong positive correlations with both FGA and FTA. The correlations between PMSS–FGA and PMSS–FTA are statistically significant ($p<1e^{-8}$), with coefficients of 0.648 and 0.587, respectively.

Replicating the results

Run PMSS_evaluation.py to obtain PMSS of the six log parsers, then run correlation_analysis.py to conduct Spearman’s $\rho$ analyses between PMSS and FTA-FGA obtained on the standard corrected Loghub 2.0 ground-truth (V2).

RQ3: How efficient is the calculation of PMSS in comparison to FGA and FTA?

Findings

With the guidance of labels, FGA and FTA generally evaluate faster than PMSS on most datasets. In contrast, the computation time of PMSS depends on multiple factors (e.g., message length, the number of inferred variables, and data preprocessing task), but remains linear in most cases. The PMSS computation time for all 6 tools is available in PMSS_evaluation/full_time.csv

Folder Structure

├── benchmark
    ├── evaluation # Configurations for the parsers
        ├── logparser
        ├── utils
        ├── evaluation_only.py # No parse, only evaluate
        └── __init__.py
    └── run_all_full.sh # Script for running label-based evaluations
├── figures 
├── full_dataset # Store original Loghub 2.0 here; corrected ground-truths will also be stored here
├── label_based_evaluations # Parsers' GA, PA, FGA, and FTA results (For RQ1 and RQ2)
├── PMSS_evaluation # Overall PMSS scores and template EMSS scores (under each tool folder) on all datasets
    ├── Drain
    ├── LibreLog
    ├── LILAC
    ├── LogBatcher
    ├── LUNAR
    ├── Preprocessed_Drain
    ├── full_PMSS.csv
    └── full_time.csv
├── result # Store the parsing results here
├── template_comparisons # Template differences across the five versions
├── correlation_analaysis.py # Code for PMSS and FGA-FTA correlation analysis
├── label_based_metrics_comparison.py # Code for label-based metric result comparison
├── plot_time_consumption.py # Code for time consumption plotting
├── PMSS_evaluation.py # Code for PMSS calculation
├── template_correction.py # Code for ground-truth template correction
├── template_differences.py # Code for ground-truth template comparison
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Label-Free Metric for Log Parser Evaluation

Introduction

Dependencies

Dataset

RQ1: How do inconsistencies among ground-truth versions of log data influence the reliability of log parsing evaluation results?

Findings

Replicating the results

RQ2: What is the relationship between PMSS and FGA-FTA?

Findings

Replicating the results

RQ3: How efficient is the calculation of PMSS in comparison to FGA and FTA?

Findings

Folder Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
PMSS_evaluation		PMSS_evaluation
benchmark		benchmark
figures		figures
full_dataset		full_dataset
label_based_evaluations		label_based_evaluations
result		result
template_comparisons		template_comparisons
LICENSE		LICENSE
PMSS_evaluation.py		PMSS_evaluation.py
README.md		README.md
correlation_analysis.py		correlation_analysis.py
label_based_metrics_comparison.py		label_based_metrics_comparison.py
plot_time_consumption.py		plot_time_consumption.py
template_correction.py		template_correction.py
template_differences.py		template_differences.py

Folders and files

Latest commit

History

Repository files navigation

Label-Free Metric for Log Parser Evaluation

Introduction

Dependencies

Dataset

RQ1: How do inconsistencies among ground-truth versions of log data influence the reliability of log parsing evaluation results?

Findings

Replicating the results

RQ2: What is the relationship between PMSS and FGA-FTA?

Findings

Replicating the results

RQ3: How efficient is the calculation of PMSS in comparison to FGA and FTA?

Findings

Folder Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages