Skip to content

[Bug]: Markdown export loses heading hierarchy and table structure #1964

@klemens-u

Description

@klemens-u

crawl4ai version

0.8.6

Expected Behavior

When converting a page with clear document structure in HTML, the markdown output should preserve:

  1. Heading hierarchyh1 through h6 levels mapped to # through ###### consistently
  2. Table structure — HTML <table> elements converted to valid GitHub-flavored markdown tables (or documented alternative)

Alternatively, provide explicit configuration options or separate commands so users can choose between:

  • Fast/minimal markdown (current behavior, smaller output)
  • Structure-preserving markdown (accurate headings, proper tables)

Suggested naming:

  • Config approach: markdown.mode: "compact" | "semantic" or markdown.preserve_structure: true
  • Separate commands: md-lite / md-semantic (CLI), crawl4ai_md / crawl4ai_md_semantic (MCP)

Suggested resolution

Option A — Configuration flags:

{
  "crawler_config": {
    "markdown": {
      "mode": "semantic",
      "preserve_headings": true,
      "preserve_tables": "gfm"
    }
  }
}

Option B — Separate commands/endpoints:

Use case CLI flag MCP tool name
Fast, minimal -o markdown or -o md-lite crawl4ai_md
Structure-preserving -o md-semantic crawl4ai_md_semantic

Current Behavior

For the same URL and crawl settings, markdown output loses or degrades structural information that is preserved in HTML:

Element HTML Output Markdown Output
Headings Clear h1 > h2 > h3 nesting Levels flattened or inconsistent
Tables Valid <table> with rows/columns Flattened to lists, paragraphs, or lost entirely

Users must switch to HTML output and manually extract structure, defeating markdown's purpose as a readable, structured format.

Is this reproducible?

Yes

Inputs Causing the Bug

- **Documentation pages** with hierarchical sections (single `h1`, multiple `h2`/`h3` levels)
- **Data pages** with comparison tables, specification tables, or course lists in `<table>` markup
- **Any public URL** (or local HTML fixture) containing structured content

Steps to Reproduce

1. Identify a page with known heading hierarchy (`h1``h2``h3`) and at least one HTML table
2. Crawl with **markdown** output:
   
   crwl 'https://example.invalid/test-page' -o markdown
   
3. Crawl the **same page** with **HTML** output:
   
   crwl 'https://example.invalid/test-page' -o html
   
4. Compare:
   - Count heading levels in markdown vs HTML DOM
   - Check if table structure is preserved as `|` delimited markdown
5. Observe: markdown flattens headings and loses table formatting

Code snippets

**CLI comparison:**

# Markdown - structure loss visible here
crwl 'https://example.invalid/test-page' -o markdown > output.md

# HTML - structure preserved
crwl 'https://example.invalid/test-page' -o html > output.html


**HTTP API:**

# Markdown request
curl -sS 'http://localhost:PORT/crawl' \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": ["https://example.invalid/test-page"],
    "crawler_config": { "cache_mode": "bypass" }
  }' | jq -r '.results[0].markdown.raw_markdown'

# HTML request for comparison
curl -sS 'http://localhost:PORT/crawl' \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": ["https://example.invalid/test-page"],
    "crawler_config": { "cache_mode": "bypass" }
  }' | jq -r '.results[0].html'

OS

Debian GNU/Linux 12 (bookworm)

Python version

3.12.13

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions