By Lee Foot | 21st October 2025 LeeFoot.com
A Python script that uses Claude AI to automatically identify and extract major content blocks from web pages, generating XPath selectors for each element.
This tool crawls a list of URLs, extracts the HTML content, and uses Claude Haiku to intelligently identify major content sections (hero sections, feature blocks, carousels, etc.) along with robust XPath expressions to select them. Perfect for web scraping, content analysis, or building automated testing selectors.
- AI-Powered Extraction: Uses Claude Haiku to intelligently identify content blocks
- Frequency Analysis: Automatically counts how often each XPath appears across all pages
- Incremental Saving: Saves progress every N rows to prevent data loss
- Batch Processing: Process multiple URLs with automatic rate limiting
- Post-Processing: Combines all results and standardizes naming by XPath
- Debug Mode: Test with a small subset of URLs before full run
- Python 3.7+
- Anthropic API key (Claude)
-
Download the script
extract_content_blocks.py -
Install required packages:
pip install requests beautifulsoup4 anthropic pandas-
Get your Anthropic API key from https://console.anthropic.com/
-
Open the script and add your API key:
# Your API key
api_key = "YOUR CLAUDE KEY HERE" # Replace with your actual API keyzb_extract_content_blocks/
│
├── input/
│ └── urls.txt # Your list of URLs to process
│
├── output/
│ ├── content_blocks_*.csv # Individual batch files
│ └── combined_output_with_frequency.csv # Final processed results
│
└── extract_content_blocks.py # Main script
Create a file at input/urls.txt with one URL per line:
https://example.com
https://example.com/products
https://example.com/about
https://anothersite.com
Edit these settings at the top of extract_content_blocks.py:
# DEBUG MODE: Set to True to process only 2 URLs for testing
DEBUG_MODE = False
# INCREMENTAL SAVE: Save progress every N rows
SAVE_EVERY_N_ROWS = 50
# Update paths if needed
INPUT_FILE = r"C:\python_scripts\zb_extract_content_blocks\input\urls.txt"
OUTPUT_DIR = r"C:\python_scripts\zb_extract_content_blocks\output"python extract_content_blocks.pyThe script creates timestamped CSV files as it processes URLs:
content_blocks_20240121_143022_a1b2c3d4.csvcontent_blocks_20240121_143145_e5f6g7h8.csv
Each contains:
| Column | Description |
|---|---|
url |
Source URL |
name |
Descriptive name of the content block |
xpath |
XPath selector for the element |
notes |
Brief description |
After processing all URLs, the script generates:
combined_output_with_frequency.csv
Contains all results with additional columns:
| Column | Description |
|---|---|
url |
Source URL |
name |
Standardized name (consistent for same XPath) |
xpath |
XPath selector |
notes |
Description |
frequency |
How many times this XPath appears across all pages |
Sorted by frequency (most common XPaths first).
Top 10 XPaths by frequency:
================================================================================
45x - Hero Section
//div[@class='hero-banner']
38x - Navigation Menu
//nav[@id='main-navigation']
35x - Footer
//footer[@class='site-footer']
28x - Feature Grid
//section[@class='features']
For testing, enable debug mode to process only the first 2 URLs:
DEBUG_MODE = TrueThis is useful for:
- Testing your API key setup
- Verifying URL format
- Checking output structure
- Estimating API costs
The script includes a 1-second delay between requests to be respectful to web servers:
time.sleep(1) # In fetch_webpage()Adjust as needed based on the target site's policies.
The script uses prompt caching to reduce costs:
- HTML content is cached
- System prompt is cached
- Subsequent similar requests use cached tokens at ~90% discount
Console output shows token usage for each request:
Tokens - Input: 1234, Cache creation: 5678, Cache read: 9012, Output: 345
- Invalid URLs are skipped with error messages
- Network errors are caught and logged
- JSON parsing errors are handled gracefully
- Progress is saved incrementally to prevent data loss
With Claude Haiku 4.5 (current pricing):
- Input: $1 per million tokens
- Output: $5 per million tokens
- Cached input: ~$0.10 per million tokens (90% discount)
Average cost per URL: $0.002 - $0.015 depending on page size
The script uses prompt caching to significantly reduce costs on repeated similar requests.
"No module named 'anthropic'"
pip install anthropic"401 Unauthorized" or API authentication errors
- Ensure you've replaced
"YOUR CLAUDE KEY HERE"with your actual API key - Verify your API key is valid at https://console.anthropic.com/
"403 Forbidden" errors when fetching URLs
- Check if the website allows scraping (check robots.txt)
- Some sites block automated requests
Large HTML pages timing out
- Increase timeout in
fetch_webpage():timeout=60 - Consider filtering more aggressively in
filter_html()
- Always test with DEBUG_MODE=True first
- Keep your API key secure - don't share your script with the key in it
- Respect robots.txt and terms of service
- Use appropriate rate limiting for target sites
- Monitor API usage and costs in the Anthropic console
MIT License - feel free to use and modify as needed.
Contributions are welcome! Please feel free to submit a Pull Request.
- Built with Anthropic's Claude API
- Uses BeautifulSoup for HTML parsing
- Powered by Python
For issues or questions:
- Check the Troubleshooting section above
- Review Anthropic's documentation
- Open an issue on GitHub
Note: This tool is for educational and legitimate web scraping purposes only. Always ensure you have permission to scrape websites and comply with their terms of service and robots.txt files.