The replication package for "A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation".
PMSS (parser medoid silhouette score) is a novel label-free template-level log parsing metric. It evaluates both parser grouping and template quality with medoid silhouette analysis and Levenshtein distance within a linear time complexity in general. To highlight its relationship with label-based template-level metrics (i.e., FGA and FTA), we compared their evaluation outcomes on the standard corrected Loghub 2.0 dataset. According to the results, log parsers with the highest PMSS have 2.1% average relative difference with the optimal FGA, and 9.8% with optimal FTA. PMSS is also significantly (
- python>=3.8
- chardet==5.1.0
- ipython==8.12.0
- matplotlib==3.7.2
- natsort==8.4.0
- numpy==1.24.4
- pandas==2.0.3
- regex==2022.3.2
- scipy
- tqdm==4.65.0
- rpy2
- editdistance
The original Loghub 2.0 data is downloaded directly from Zenodo. After obtaining the original Loghub 2.0 dataset (ground-truth V1), store the files under the full_dataset folder and run template_correction.py to create the remaining four versions of corrected ground-truths. (V2: Standard corrected Loghub 2.0; V3: LogBatcher; V4: UNLEASH; V5: LUNAR).
RQ1: How do inconsistencies among ground-truth versions of log data influence the reliability of log parsing evaluation results?
As shown in the following table, correction rules are implemented differently across versions. Further, although two rules (DV and CV) are used in all functions, their implementation details are different.
We also investigated the template difference ratio at the template and message levels. Given that the correction rules are largely enriched in LUNAR, its ground-truth version largely differs from the other four versions on both template and message levels.
The min-max score differences among ground-truth versions for each parser on every dataset are shown in the following table. The corrections do not affect GA and FGA. However, parsers’ PA and FTA can largely vary across different ground-truth versions, and parsers with better parsing performances (i.e., LogBatcher and LUNAR) are more sensitive to changes.
We also found that the discrepancy in PA and FTA scores across versions will lead to different optimal tools on the same dataset. The optimal tool inconsistencies caused by score value shifts make it difficult to compare parser effectiveness and make selections.
After obtaining all ground-truth versions, run code template_differences to compare the template differences on the template and message level. Then, get the parsing results with one of the following options:
Option 1: The parsing results of the six log parsers (Drain, Preprocessed-Drain, LILAC, LibreLog, LogBatcher, and LUNAR) evaluated in our study are available on Zenodo. To directly evaluate these results, download the files and unzip them to the result folder. Afterward, change directory to benchmark and run run_all_full.sh.
Option 2: Run the parsers with their official code under default settings to parse Loghub 2.0. After obtaining the results, move them to the result folder and run run_all_full.sh under the benchmark directory.
!!! IMPORTANT !!! We included 4 LLM-based log parsers in this study. Due to the stochastic nature of LLMs, you may obtain different evaluation outcomes if the parsing results are obtained with Option 2.
Run label_based_metrics_comparison.py to obtain the comparison results in RQ1.
The FGA-FTA obtained from the standard corrected Loghub 2.0 ground-truth and PMSS of each parser are shown in the following table. PMSS selects the same optimal parser as FGA in 7 out of the 14 studied datasets, and log parsers achieving the highest PMSS or FGA differ by only 2.1% on average in terms of FGA. On the other hand, given that PMSS is calculated with semantical-structural similarity instead of boolean identicalness, it selects the same optimal parser as FTA on only 3 datasets. Differences in evaluation strategies lead to some divergent patterns between PMSS, FGA, and FTA.
According to Spearman’s
Run PMSS_evaluation.py to obtain PMSS of the six log parsers, then run correlation_analysis.py to conduct Spearman’s
With the guidance of labels, FGA and FTA generally evaluate faster than PMSS on most datasets. In contrast, the computation time of PMSS depends on multiple factors (e.g., message length, the number of inferred variables, and data preprocessing task), but remains linear in most cases. The PMSS computation time for all 6 tools is available in PMSS_evaluation/full_time.csv
├── benchmark
├── evaluation # Configurations for the parsers
├── logparser
├── utils
├── evaluation_only.py # No parse, only evaluate
└── __init__.py
└── run_all_full.sh # Script for running label-based evaluations
├── figures
├── full_dataset # Store original Loghub 2.0 here; corrected ground-truths will also be stored here
├── label_based_evaluations # Parsers' GA, PA, FGA, and FTA results (For RQ1 and RQ2)
├── PMSS_evaluation # Overall PMSS scores and template EMSS scores (under each tool folder) on all datasets
├── Drain
├── LibreLog
├── LILAC
├── LogBatcher
├── LUNAR
├── Preprocessed_Drain
├── full_PMSS.csv
└── full_time.csv
├── result # Store the parsing results here
├── template_comparisons # Template differences across the five versions
├── correlation_analaysis.py # Code for PMSS and FGA-FTA correlation analysis
├── label_based_metrics_comparison.py # Code for label-based metric result comparison
├── plot_time_consumption.py # Code for time consumption plotting
├── PMSS_evaluation.py # Code for PMSS calculation
├── template_correction.py # Code for ground-truth template correction
├── template_differences.py # Code for ground-truth template comparison
└── README.md





