Skip to content

mooselab/Label-Free-Metric-for-Log-Parser-Evaluation

Repository files navigation

Label-Free Metric for Log Parser Evaluation

The replication package for "A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation".

Introduction

PMSS (parser medoid silhouette score) is a novel label-free template-level log parsing metric. It evaluates both parser grouping and template quality with medoid silhouette analysis and Levenshtein distance within a linear time complexity in general. To highlight its relationship with label-based template-level metrics (i.e., FGA and FTA), we compared their evaluation outcomes on the standard corrected Loghub 2.0 dataset. According to the results, log parsers with the highest PMSS have 2.1% average relative difference with the optimal FGA, and 9.8% with optimal FTA. PMSS is also significantly ($p<1e^{-8}$) and positively related to both FGA and FTA: the correlations are respectively 0.648 and 0.587. Our label-free metric provides a valuable evaluation alternative when ground-truths are inconsistent or no labeled data is available.

Dependencies

  • python>=3.8
  • chardet==5.1.0
  • ipython==8.12.0
  • matplotlib==3.7.2
  • natsort==8.4.0
  • numpy==1.24.4
  • pandas==2.0.3
  • regex==2022.3.2
  • scipy
  • tqdm==4.65.0
  • rpy2
  • editdistance

Dataset

The original Loghub 2.0 data is downloaded directly from Zenodo. After obtaining the original Loghub 2.0 dataset (ground-truth V1), store the files under the full_dataset folder and run template_correction.py to create the remaining four versions of corrected ground-truths. (V2: Standard corrected Loghub 2.0; V3: LogBatcher; V4: UNLEASH; V5: LUNAR).

RQ1: How do inconsistencies among ground-truth versions of log data influence the reliability of log parsing evaluation results?

Findings

As shown in the following table, correction rules are implemented differently across versions. Further, although two rules (DV and CV) are used in all functions, their implementation details are different.

image

We also investigated the template difference ratio at the template and message levels. Given that the correction rules are largely enriched in LUNAR, its ground-truth version largely differs from the other four versions on both template and message levels.

image

The min-max score differences among ground-truth versions for each parser on every dataset are shown in the following table. The corrections do not affect GA and FGA. However, parsers’ PA and FTA can largely vary across different ground-truth versions, and parsers with better parsing performances (i.e., LogBatcher and LUNAR) are more sensitive to changes.

image

We also found that the discrepancy in PA and FTA scores across versions will lead to different optimal tools on the same dataset. The optimal tool inconsistencies caused by score value shifts make it difficult to compare parser effectiveness and make selections.

image

Replicating the results

After obtaining all ground-truth versions, run code template_differences to compare the template differences on the template and message level. Then, get the parsing results with one of the following options:

Option 1: The parsing results of the six log parsers (Drain, Preprocessed-Drain, LILAC, LibreLog, LogBatcher, and LUNAR) evaluated in our study are available on Zenodo. To directly evaluate these results, download the files and unzip them to the result folder. Afterward, change directory to benchmark and run run_all_full.sh.

Option 2: Run the parsers with their official code under default settings to parse Loghub 2.0. After obtaining the results, move them to the result folder and run run_all_full.sh under the benchmark directory.

!!! IMPORTANT !!! We included 4 LLM-based log parsers in this study. Due to the stochastic nature of LLMs, you may obtain different evaluation outcomes if the parsing results are obtained with Option 2.

Run label_based_metrics_comparison.py to obtain the comparison results in RQ1.

RQ2: What is the relationship between PMSS and FGA-FTA?

Findings

The FGA-FTA obtained from the standard corrected Loghub 2.0 ground-truth and PMSS of each parser are shown in the following table. PMSS selects the same optimal parser as FGA in 7 out of the 14 studied datasets, and log parsers achieving the highest PMSS or FGA differ by only 2.1% on average in terms of FGA. On the other hand, given that PMSS is calculated with semantical-structural similarity instead of boolean identicalness, it selects the same optimal parser as FTA on only 3 datasets. Differences in evaluation strategies lead to some divergent patterns between PMSS, FGA, and FTA.

image

According to Spearman’s $\rho$ analyses, PMSS shows strong positive correlations with both FGA and FTA. The correlations between PMSS–FGA and PMSS–FTA are statistically significant ($p<1e^{-8}$), with coefficients of 0.648 and 0.587, respectively.

Replicating the results

Run PMSS_evaluation.py to obtain PMSS of the six log parsers, then run correlation_analysis.py to conduct Spearman’s $\rho$ analyses between PMSS and FTA-FGA obtained on the standard corrected Loghub 2.0 ground-truth (V2).

RQ3: How efficient is the calculation of PMSS in comparison to FGA and FTA?

Findings

With the guidance of labels, FGA and FTA generally evaluate faster than PMSS on most datasets. In contrast, the computation time of PMSS depends on multiple factors (e.g., message length, the number of inferred variables, and data preprocessing task), but remains linear in most cases. The PMSS computation time for all 6 tools is available in PMSS_evaluation/full_time.csv

image

Folder Structure

├── benchmark
    ├── evaluation # Configurations for the parsers
        ├── logparser
        ├── utils
        ├── evaluation_only.py # No parse, only evaluate
        └── __init__.py
    └── run_all_full.sh # Script for running label-based evaluations
├── figures 
├── full_dataset # Store original Loghub 2.0 here; corrected ground-truths will also be stored here
├── label_based_evaluations # Parsers' GA, PA, FGA, and FTA results (For RQ1 and RQ2)
├── PMSS_evaluation # Overall PMSS scores and template EMSS scores (under each tool folder) on all datasets
    ├── Drain
    ├── LibreLog
    ├── LILAC
    ├── LogBatcher
    ├── LUNAR
    ├── Preprocessed_Drain
    ├── full_PMSS.csv
    └── full_time.csv
├── result # Store the parsing results here
├── template_comparisons # Template differences across the five versions
├── correlation_analaysis.py # Code for PMSS and FGA-FTA correlation analysis
├── label_based_metrics_comparison.py # Code for label-based metric result comparison
├── plot_time_consumption.py # Code for time consumption plotting
├── PMSS_evaluation.py # Code for PMSS calculation
├── template_correction.py # Code for ground-truth template correction
├── template_differences.py # Code for ground-truth template comparison
└── README.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors