Skip to content

feat: Rust native XML submission parser via PyO3#3014

Open
mberg wants to merge 8 commits into
mainfrom
rusty
Open

feat: Rust native XML submission parser via PyO3#3014
mberg wants to merge 8 commits into
mainfrom
rusty

Conversation

@mberg
Copy link
Copy Markdown
Member

@mberg mberg commented Feb 16, 2026

Summary

Replaces the Python minidom-based XML submission parser with a single-pass Rust native extension, eliminating 6+ redundant XML parses per submission. Feature-flagged for safe rollout.

Problem

The current XFormInstanceParser uses Python's minidom to build a full DOM tree (2-3x memory of raw XML), then recursively traverses it multiple times. During a single Instance.save(), the same XML is parsed 6+ times by separate calls to get_dict() from save(), _set_geom(), get_expected_media(), and get_full_dict(). For large submissions with deeply nested repeat groups, this is a significant CPU and memory bottleneck.

Solution

A Rust native extension (onadata_xml) compiled via PyO3/maturin that:

  • Parses XML in a single pass using quick-xml (SAX-style, no DOM tree)
  • Returns all data at once via a SubmissionResult object: nested dict, flat dict, attributes, UUID, deprecated UUID, submission date, geopoints, SHA256 checksum
  • Applies numeric conversion inline during parsing (no separate recursive pass)
  • Extracts geopoints during parsing (no separate get_values_matching_key traversal)
  • Computes SHA256 natively via the sha2 crate

A Python wrapper class (RustXFormInstanceParser) provides an identical interface to the existing XFormInstanceParser, making it a drop-in replacement.

Estimated Performance Impact

Operation Current (Python) With Rust Speedup
XML parse + dict build ~15ms/submission ~1-3ms 5-15x
Redundant re-parses eliminated 6+ passes 1 pass 6x fewer
Memory (no DOM tree) 2-3x XML size ~1x XML size 2-3x less
SHA256 hash Python hashlib Native sha2 2-5x

Changes

Rust Crate: rust/onadata_xml/ (1,805 lines)

File Description
src/parser.rs SAX-based XML-to-dict parser. Handles repeat groups, encrypted media, CDATA sections, auto-list on duplicate nodes, xpath computation, attribute collection, UUID/date extraction.
src/flatten.rs Iterative stack-based dict flattening (replaces recursive _flatten_dict_nest_repeats)
src/numeric.rs String-to-int/float conversion matching Python's numeric_checker
src/geom.rs Recursive geopoint extraction from nested dict structures
src/lib.rs PyO3 module entry. Wires all modules into parse_submission() returning SubmissionResult
Cargo.toml Dependencies: pyo3 0.23, quick-xml 0.37, sha2 0.10
pyproject.toml maturin build configuration

35 Rust unit tests covering: simple forms, nested repeats, encrypted media, CDATA, empty nodes, entity attributes, namespace variants (orx:), auto-list conversion, xpath computation, numeric edge cases (NaN, negative, float), geopoint parsing.

Python Integration (197 lines)

File Change
xform_instance_parser.py Added RustXFormInstanceParser class with identical interface to XFormInstanceParser
instance.py Modified _set_parser(), _set_geom(), _set_uuid() to use Rust parser when feature flag is on. Added shadow mode comparison logging.
settings/common.py Added USE_RUST_XML_PARSER and RUST_XML_PARSER_SHADOW_MODE flags (both False by default)
tests/test_rust_parsing.py 5 integration tests: parity tests (nested repeats, encrypted forms), UUID extraction, geom extraction, full submission round-trip

Documentation

File Description
docs/plans/2026-02-16-rust-port-analysis.md Full analysis of all CPU/memory-intensive code paths and Rust porting opportunities
docs/plans/2026-02-16-rust-xml-parser-design.md Design document for this implementation
docs/plans/2026-02-16-rust-xml-parser-plan.md Step-by-step implementation plan

How to Activate

# In your Django settings override:

# Option 1: Shadow mode (safe - runs both parsers, logs differences)
RUST_XML_PARSER_SHADOW_MODE = True

# Option 2: Full switch (after shadow mode validates parity)
USE_RUST_XML_PARSER = True

Rollout Strategy

  1. Shadow mode in staging - Set RUST_XML_PARSER_SHADOW_MODE = True. Both parsers run, differences logged to onadata.rust_parser_shadow logger. Zero risk to production behavior.
  2. Feature flag on - After shadow mode confirms parity, set USE_RUST_XML_PARSER = True.
  3. Remove Python parser - After production validation, remove old code path.

Build Requirements

  • Rust toolchain (rustc 1.70+)
  • maturin (pip install maturin)
  • Build: cd rust/onadata_xml && maturin develop (or maturin build --release for wheels)

CI Changes Needed

  • Install Rust toolchain in CI (rustup)
  • Run cargo test in rust/onadata_xml/ as a separate CI step
  • Run maturin develop before Python test suite

Test plan

  • cargo test in rust/onadata_xml/ passes (35 tests)
  • Python import works: from onadata_xml import parse_submission
  • Existing test_parsing.py tests pass with flag OFF (no regression)
  • New test_rust_parsing.py parity tests pass with flag ON
  • Full submission round-trip works with USE_RUST_XML_PARSER=True
  • Shadow mode logs no differences on representative sample of real submissions
  • Memory usage is lower for large submissions (>100KB XML)

🤖 Generated with Claude Code

mberg and others added 8 commits February 16, 2026 07:47
Identifies 5 high-impact areas where porting Python to Rust (via PyO3)
would yield significant performance gains: export generation (10-50x),
XML submission parsing (5-20x), data aggregation (3-10x), encryption
helpers (2-5x), and CSV import (3-8x).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Approach C: Rust single-pass parser (quick-xml + PyO3) with a Python
cached wrapper class that's a drop-in replacement for XFormInstanceParser.
Feature-flagged with shadow mode for safe rollout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 tasks covering: crate scaffolding, XML-to-dict parser, dict flattening,
numeric conversion, geopoint extraction, Python wrapper, feature flag
integration, shadow mode, and end-to-end tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- parser.rs: SAX-based XML-to-dict using quick-xml, handles repeats,
  encrypted media, CDATA, auto-list on duplicate nodes, xpath computation
- flatten.rs: iterative dict flattening matching _flatten_dict_nest_repeats
- numeric.rs: string-to-int/float conversion matching numeric_checker
- geom.rs: recursive geopoint extraction from nested dicts
- lib.rs: parse_submission() returning SubmissionResult with dict, flat_dict,
  attributes, uuid, deprecated_uuid, submission_date, geom_points, checksum

35 Rust tests passing. Python smoke tests verify output parity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant