CFseq

A lightweight and practical framework for improving the compression efficiency of structured network logs by constructing compression-friendly field sequences.

Features

Analyzes column entropy to determine optimal ordering
Supports multiple compression algorithms (gzip, bz2, lzma, etc.)
Parallel processing for efficient computation
Sampling mechanism for large datasets
Configurable parameters for different optimization strategies

Installation

Clone this repository
Ensure you have Python 3.11 installed
Install the required dependencies: pip install pandas termcolor
For denum compression verification, download and compile the denum code from: https://github.com/gaiusyu/Denum

Usage

python CFLseq.py -s <source_dir> -d <dest_dir> [options]

Required Arguments

-s/--src_dir: Directory containing input CSV files
-d/--dst_dir: Output directory for reordered files

Optional Arguments

-split/--sepChar: CSV delimiter character (default: ',')
-ct/--kernel: Compression type (gzip, bz2, denum_bz2, denum_7z, denum_gzip, lzma) (default: 'gzip')
-num_samples/--num_samples: Number of samples per file (default: 15)
-sample_size/--size_of_each_sample: Size of each sample (default: 1000)
-field_num/--searchNumFields: Number of fields for brute-force search (default: 6)
-seed/--random_seed: Random seed for reproducibility (default: 0)
-c/--cpu_num: Number of CPUs to use (default: 30)

Algorithm

Sampling: Randomly samples input files and rows to create a representative sample file.
Entropy Analysis: Calculates entropy for each column to identify low-information columns
Friend Finding: Identifies columns that have low joint entropy (optional)
Brute-force Search: Tests permutations of column orderings to find the most compressible sequence
Reordering: Applies the optimal ordering to all input files

Output

The tool creates a directory structure in the output directory organized by:

Random seed used
Compression algorithm
Reordered CSV files with optimal column ordering

Performance Notes

The brute-force search can be computationally expensive for large numbers of columns
Adjust searchNumFields to control the scope of the search
Use more CPUs (cpu_num) to speed up processing
Sampling parameters can be adjusted to balance accuracy and performance

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CFLseq.py		CFLseq.py
GetDenumSize.sh		GetDenumSize.sh
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CFseq

Features

Installation

Usage

Required Arguments

Optional Arguments

Algorithm

Output

Performance Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CFseq

Features

Installation

Usage

Required Arguments

Optional Arguments

Algorithm

Output

Performance Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages