Automatically identify and classify pages by their template type using HTML structure analysis and machine learning clustering.
- Analyzes HTML structure (tags, classes, IDs, meta tags)
- Uses TF-IDF vectorization for feature extraction
- K-Means clustering to identify template patterns
- Identifies common structural patterns per cluster
- Exports results with page type classifications
- Identify different page templates on a website (PDP, PLP, blog, etc.)
- Audit template usage across large sites
- Find pages with unusual/broken templates
- Group pages for template-specific SEO recommendations
pip install -r requirements.txt- Export URLs from Screaming Frog (or create a CSV with an
Addresscolumn) - Update configuration variables in the script:
INPUT_FILE: Path to your CSV fileOUTPUT_FILE: Where to save resultsN_CLUSTERS: Number of template types to identify
- Run the script:
python template_fingerprinting.py| Variable | Default | Description |
|---|---|---|
INPUT_FILE |
./urls.csv |
CSV file with 'Address' column |
OUTPUT_FILE |
./classified_urls.csv |
Output file path |
N_CLUSTERS |
5 |
Number of template types to detect |
TIMEOUT |
10 |
HTTP request timeout in seconds |
The script generates a CSV file with:
- Original URL data
Cluster: Numeric cluster IDPage Type: Human-readable type label (Type 0, Type 1, etc.)
Console output includes top features for each cluster to help identify what each template type represents.
-
Feature Extraction: For each URL, the script fetches HTML and extracts:
- Tag counts (e.g.,
div:15,article:1) - CSS class names
- ID attributes
- Meta tag properties
- Tag counts (e.g.,
-
Vectorization: Features are converted to TF-IDF vectors
-
Clustering: K-Means groups similar page structures
-
Analysis: Top features per cluster help identify template types
Lee Foot - eCommerce SEO Consultant