crawl4ai version
0.8.6
Expected Behavior
When converting a page with clear document structure in HTML, the markdown output should preserve:
- Heading hierarchy —
h1 through h6 levels mapped to # through ###### consistently
- Table structure — HTML
<table> elements converted to valid GitHub-flavored markdown tables (or documented alternative)
Alternatively, provide explicit configuration options or separate commands so users can choose between:
- Fast/minimal markdown (current behavior, smaller output)
- Structure-preserving markdown (accurate headings, proper tables)
Suggested naming:
- Config approach:
markdown.mode: "compact" | "semantic" or markdown.preserve_structure: true
- Separate commands:
md-lite / md-semantic (CLI), crawl4ai_md / crawl4ai_md_semantic (MCP)
Suggested resolution
Option A — Configuration flags:
{
"crawler_config": {
"markdown": {
"mode": "semantic",
"preserve_headings": true,
"preserve_tables": "gfm"
}
}
}
Option B — Separate commands/endpoints:
| Use case |
CLI flag |
MCP tool name |
| Fast, minimal |
-o markdown or -o md-lite |
crawl4ai_md |
| Structure-preserving |
-o md-semantic |
crawl4ai_md_semantic |
Current Behavior
For the same URL and crawl settings, markdown output loses or degrades structural information that is preserved in HTML:
| Element |
HTML Output |
Markdown Output |
| Headings |
Clear h1 > h2 > h3 nesting |
Levels flattened or inconsistent |
| Tables |
Valid <table> with rows/columns |
Flattened to lists, paragraphs, or lost entirely |
Users must switch to HTML output and manually extract structure, defeating markdown's purpose as a readable, structured format.
Is this reproducible?
Yes
Inputs Causing the Bug
- **Documentation pages** with hierarchical sections (single `h1`, multiple `h2`/`h3` levels)
- **Data pages** with comparison tables, specification tables, or course lists in `<table>` markup
- **Any public URL** (or local HTML fixture) containing structured content
Steps to Reproduce
1. Identify a page with known heading hierarchy (`h1` → `h2` → `h3`) and at least one HTML table
2. Crawl with **markdown** output:
crwl 'https://example.invalid/test-page' -o markdown
3. Crawl the **same page** with **HTML** output:
crwl 'https://example.invalid/test-page' -o html
4. Compare:
- Count heading levels in markdown vs HTML DOM
- Check if table structure is preserved as `|` delimited markdown
5. Observe: markdown flattens headings and loses table formatting
Code snippets
**CLI comparison:**
# Markdown - structure loss visible here
crwl 'https://example.invalid/test-page' -o markdown > output.md
# HTML - structure preserved
crwl 'https://example.invalid/test-page' -o html > output.html
**HTTP API:**
# Markdown request
curl -sS 'http://localhost:PORT/crawl' \
-H 'Content-Type: application/json' \
-d '{
"urls": ["https://example.invalid/test-page"],
"crawler_config": { "cache_mode": "bypass" }
}' | jq -r '.results[0].markdown.raw_markdown'
# HTML request for comparison
curl -sS 'http://localhost:PORT/crawl' \
-H 'Content-Type: application/json' \
-d '{
"urls": ["https://example.invalid/test-page"],
"crawler_config": { "cache_mode": "bypass" }
}' | jq -r '.results[0].html'
OS
Debian GNU/Linux 12 (bookworm)
Python version
3.12.13
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
crawl4ai version
0.8.6
Expected Behavior
When converting a page with clear document structure in HTML, the markdown output should preserve:
h1throughh6levels mapped to#through######consistently<table>elements converted to valid GitHub-flavored markdown tables (or documented alternative)Alternatively, provide explicit configuration options or separate commands so users can choose between:
Suggested naming:
markdown.mode: "compact" | "semantic"ormarkdown.preserve_structure: truemd-lite/md-semantic(CLI),crawl4ai_md/crawl4ai_md_semantic(MCP)Suggested resolution
Option A — Configuration flags:
{ "crawler_config": { "markdown": { "mode": "semantic", "preserve_headings": true, "preserve_tables": "gfm" } } }Option B — Separate commands/endpoints:
-o markdownor-o md-litecrawl4ai_md-o md-semanticcrawl4ai_md_semanticCurrent Behavior
For the same URL and crawl settings, markdown output loses or degrades structural information that is preserved in HTML:
h1>h2>h3nesting<table>with rows/columnsUsers must switch to HTML output and manually extract structure, defeating markdown's purpose as a readable, structured format.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Debian GNU/Linux 12 (bookworm)
Python version
3.12.13
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response