Conversation
Identifies 5 high-impact areas where porting Python to Rust (via PyO3) would yield significant performance gains: export generation (10-50x), XML submission parsing (5-20x), data aggregation (3-10x), encryption helpers (2-5x), and CSV import (3-8x). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Approach C: Rust single-pass parser (quick-xml + PyO3) with a Python cached wrapper class that's a drop-in replacement for XFormInstanceParser. Feature-flagged with shadow mode for safe rollout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 tasks covering: crate scaffolding, XML-to-dict parser, dict flattening, numeric conversion, geopoint extraction, Python wrapper, feature flag integration, shadow mode, and end-to-end tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- parser.rs: SAX-based XML-to-dict using quick-xml, handles repeats, encrypted media, CDATA, auto-list on duplicate nodes, xpath computation - flatten.rs: iterative dict flattening matching _flatten_dict_nest_repeats - numeric.rs: string-to-int/float conversion matching numeric_checker - geom.rs: recursive geopoint extraction from nested dicts - lib.rs: parse_submission() returning SubmissionResult with dict, flat_dict, attributes, uuid, deprecated_uuid, submission_date, geom_points, checksum 35 Rust tests passing. Python smoke tests verify output parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the Python minidom-based XML submission parser with a single-pass Rust native extension, eliminating 6+ redundant XML parses per submission. Feature-flagged for safe rollout.
Problem
The current
XFormInstanceParseruses Python'sminidomto build a full DOM tree (2-3x memory of raw XML), then recursively traverses it multiple times. During a singleInstance.save(), the same XML is parsed 6+ times by separate calls toget_dict()fromsave(),_set_geom(),get_expected_media(), andget_full_dict(). For large submissions with deeply nested repeat groups, this is a significant CPU and memory bottleneck.Solution
A Rust native extension (
onadata_xml) compiled via PyO3/maturin that:quick-xml(SAX-style, no DOM tree)SubmissionResultobject: nested dict, flat dict, attributes, UUID, deprecated UUID, submission date, geopoints, SHA256 checksumget_values_matching_keytraversal)sha2crateA Python wrapper class (
RustXFormInstanceParser) provides an identical interface to the existingXFormInstanceParser, making it a drop-in replacement.Estimated Performance Impact
sha2Changes
Rust Crate:
rust/onadata_xml/(1,805 lines)src/parser.rssrc/flatten.rs_flatten_dict_nest_repeats)src/numeric.rsnumeric_checkersrc/geom.rssrc/lib.rsparse_submission()returningSubmissionResultCargo.tomlpyproject.toml35 Rust unit tests covering: simple forms, nested repeats, encrypted media, CDATA, empty nodes, entity attributes, namespace variants (
orx:), auto-list conversion, xpath computation, numeric edge cases (NaN, negative, float), geopoint parsing.Python Integration (197 lines)
xform_instance_parser.pyRustXFormInstanceParserclass with identical interface toXFormInstanceParserinstance.py_set_parser(),_set_geom(),_set_uuid()to use Rust parser when feature flag is on. Added shadow mode comparison logging.settings/common.pyUSE_RUST_XML_PARSERandRUST_XML_PARSER_SHADOW_MODEflags (bothFalseby default)tests/test_rust_parsing.pyDocumentation
docs/plans/2026-02-16-rust-port-analysis.mddocs/plans/2026-02-16-rust-xml-parser-design.mddocs/plans/2026-02-16-rust-xml-parser-plan.mdHow to Activate
Rollout Strategy
RUST_XML_PARSER_SHADOW_MODE = True. Both parsers run, differences logged toonadata.rust_parser_shadowlogger. Zero risk to production behavior.USE_RUST_XML_PARSER = True.Build Requirements
maturin(pip install maturin)cd rust/onadata_xml && maturin develop(ormaturin build --releasefor wheels)CI Changes Needed
rustup)cargo testinrust/onadata_xml/as a separate CI stepmaturin developbefore Python test suiteTest plan
cargo testinrust/onadata_xml/passes (35 tests)from onadata_xml import parse_submissiontest_parsing.pytests pass with flag OFF (no regression)test_rust_parsing.pyparity tests pass with flag ONUSE_RUST_XML_PARSER=True🤖 Generated with Claude Code