A lightweight and practical framework for improving the compression efficiency of structured network logs by constructing compression-friendly field sequences.
- Analyzes column entropy to determine optimal ordering
- Supports multiple compression algorithms (gzip, bz2, lzma, etc.)
- Parallel processing for efficient computation
- Sampling mechanism for large datasets
- Configurable parameters for different optimization strategies
- Clone this repository
- Ensure you have Python 3.11 installed
- Install the required dependencies: pip install pandas termcolor
- For denum compression verification, download and compile the denum code from: https://github.com/gaiusyu/Denum
python CFLseq.py -s <source_dir> -d <dest_dir> [options]
-s/--src_dir: Directory containing input CSV files-d/--dst_dir: Output directory for reordered files
-split/--sepChar: CSV delimiter character (default: ',')-ct/--kernel: Compression type (gzip, bz2, denum_bz2, denum_7z, denum_gzip, lzma) (default: 'gzip')-num_samples/--num_samples: Number of samples per file (default: 15)-sample_size/--size_of_each_sample: Size of each sample (default: 1000)-field_num/--searchNumFields: Number of fields for brute-force search (default: 6)-seed/--random_seed: Random seed for reproducibility (default: 0)-c/--cpu_num: Number of CPUs to use (default: 30)
- Sampling: Randomly samples input files and rows to create a representative sample file.
- Entropy Analysis: Calculates entropy for each column to identify low-information columns
- Friend Finding: Identifies columns that have low joint entropy (optional)
- Brute-force Search: Tests permutations of column orderings to find the most compressible sequence
- Reordering: Applies the optimal ordering to all input files
The tool creates a directory structure in the output directory organized by:
- Random seed used
- Compression algorithm
- Reordered CSV files with optimal column ordering
- The brute-force search can be computationally expensive for large numbers of columns
- Adjust
searchNumFieldsto control the scope of the search - Use more CPUs (
cpu_num) to speed up processing - Sampling parameters can be adjusted to balance accuracy and performance