Skip to content
This repository was archived by the owner on May 3, 2026. It is now read-only.

dyunwei/CFseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CFseq

A lightweight and practical framework for improving the compression efficiency of structured network logs by constructing compression-friendly field sequences.

Features

  • Analyzes column entropy to determine optimal ordering
  • Supports multiple compression algorithms (gzip, bz2, lzma, etc.)
  • Parallel processing for efficient computation
  • Sampling mechanism for large datasets
  • Configurable parameters for different optimization strategies

Installation

  1. Clone this repository
  2. Ensure you have Python 3.11 installed
  3. Install the required dependencies: pip install pandas termcolor
  4. For denum compression verification, download and compile the denum code from: https://github.com/gaiusyu/Denum

Usage

python CFLseq.py -s <source_dir> -d <dest_dir> [options]

Required Arguments

  • -s/--src_dir: Directory containing input CSV files
  • -d/--dst_dir: Output directory for reordered files

Optional Arguments

  • -split/--sepChar: CSV delimiter character (default: ',')
  • -ct/--kernel: Compression type (gzip, bz2, denum_bz2, denum_7z, denum_gzip, lzma) (default: 'gzip')
  • -num_samples/--num_samples: Number of samples per file (default: 15)
  • -sample_size/--size_of_each_sample: Size of each sample (default: 1000)
  • -field_num/--searchNumFields: Number of fields for brute-force search (default: 6)
  • -seed/--random_seed: Random seed for reproducibility (default: 0)
  • -c/--cpu_num: Number of CPUs to use (default: 30)

Algorithm

  1. Sampling: Randomly samples input files and rows to create a representative sample file.
  2. Entropy Analysis: Calculates entropy for each column to identify low-information columns
  3. Friend Finding: Identifies columns that have low joint entropy (optional)
  4. Brute-force Search: Tests permutations of column orderings to find the most compressible sequence
  5. Reordering: Applies the optimal ordering to all input files

Output

The tool creates a directory structure in the output directory organized by:

  • Random seed used
  • Compression algorithm
  • Reordered CSV files with optimal column ordering

Performance Notes

  • The brute-force search can be computationally expensive for large numbers of columns
  • Adjust searchNumFields to control the scope of the search
  • Use more CPUs (cpu_num) to speed up processing
  • Sampling parameters can be adjusted to balance accuracy and performance

About

A lightweight and practical framework for improving the compression efficiency of structured network logs by constructing compression-friendly field sequences.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors