Healthcare Claims Data Quality Intelligence Platform

A production-grade EDI 837 claim validation engine that enforces field-level rules, payer-specific adjudication requirements, and revenue integrity checks across multi-payer healthcare claim submissions.

Built from direct experience with real-world RCM operations — specifically the failure patterns that cause the most preventable revenue loss: NPI errors, diagnosis pointer mismatches, timely filing violations, and payer-specific field gaps.

The Problem This Solves

Healthcare billing teams lose significant revenue to preventable technical denials — claims rejected by payers not because care wasn't medically necessary, but because a field was wrong, missing, or didn't match payer-specific formatting rules.

The standard workflow at most practices:

Submit claim to payer portal
Claim gets rejected (sometimes days later)
Billing team manually reviews rejection reason
Resubmit — now delayed by days or weeks

This platform moves validation upstream — before submission. Every claim runs through a rule engine that catches critical errors at the point of ingestion, not after denial.

Architecture

┌─────────────────────┐
│   Input Claims CSV   │  ← EDI 837 field-mapped records
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Validation Engine   │  ← Base rules + payer-specific config
│  (engine.py)         │
│                      │
│  ┌───────────────┐   │
│  │  Base Rules   │   │  ← 8 universal EDI 837 field validators
│  └───────────────┘   │
│  ┌───────────────┐   │
│  │ Payer Config  │   │  ← Per-payer JSON rules (BCBS/Aetna/Cigna/Humana/Medicare)
│  └───────────────┘   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  ValidationResult    │  ← VALID / FLAGGED / REJECTED + error list
│  per claim           │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│     Reporter         │  ← 3 output formats
│                      │
│  • Full report CSV   │  ← one row per claim
│  • Error breakdown   │  ← one row per error
│  • Summary JSON      │  ← KPIs and revenue at risk
└─────────────────────┘

Validation Rules

Base EDI 837 Rules (Universal — all payers)

Rule	EDI Segment	Error Code	Severity
Billing NPI present and valid (Luhn check)	NM1*85	NPI_001–004	CRITICAL
Rendering NPI present and valid	NM1*82	NPI_005–006	CRITICAL
Primary diagnosis code ICD-10 format	HI	DX_001–002	CRITICAL
Diagnosis pointer references valid position	SV1	DX_003–004	CRITICAL/WARNING
Date of service not future, within timely filing	DTP*472	DATE_001–004	CRITICAL/WARNING
Procedure code valid CPT or HCPCS format	SV1	PROC_001–002	CRITICAL
Billed amount greater than zero	CLM	AMT_001–002	CRITICAL
Subscriber/Member ID present	NM1*IL	SUB_001	CRITICAL
Place of service valid CMS code	CLM	POS_001–002	CRITICAL/WARNING

Payer-Specific Rules (from JSON config)

Payer	Timely Filing	Additional Requirements
BCBS	365 days	Taxonomy code required
Aetna	180 days	Subscriber ID: 9 numeric digits
Cigna	90 days	Prior auth number, Subscriber ID: U + 8 digits
Humana	365 days	Taxonomy code, Subscriber ID: H + 8 digits + letter
Medicare	365 days	Taxonomy code (NUCC), CLIA for lab claims

Sample Run Results

Validated against 1,200 synthetic claims across 5 payers:

============================================================
  CLAIMS VALIDATION SUMMARY
============================================================
  Total Claims:       1,200
  Valid:              668   (55.7%)
  Flagged:            55    (4.6%)
  Rejected:           477   (39.8%)
  Pass Rate:          55.67%
  Total Billed:       $2,707,848.63
  Revenue at Risk:    $1,164,308.52
  Risk %:             43.0%

  TOP ERROR CODES:
    PAYER_TF_001:  259  Timely filing limit exceeded
    NPI_002:        66  NPI contains non-numeric characters
    DX_004:         48  Invalid diagnosis pointer
    NPI_003:        48  NPI wrong length
    NPI_006:        42  Rendering NPI invalid
    DX_002:         41  ICD-10 format violation
    PROC_002:       27  Invalid CPT/HCPCS code
    AMT_002:        26  Zero billed amount
    PAYER_SUB_001:  21  Subscriber ID format mismatch
    DATE_003:       19  Future date of service
============================================================

Project Structure

healthcare-claims-dq-platform/
│
├── claims_validator/
│   ├── __init__.py       # Package exports
│   ├── models.py         # ValidationResult, ValidationError, ClaimStatus
│   ├── rules.py          # 8 base EDI 837 validation rule functions
│   ├── engine.py         # Orchestration — batch processing + payer config loader
│   └── reporter.py       # CSV/JSON output generation
│
├── payer_configs/
│   ├── bcbs.json         # BCBS-specific rules
│   ├── aetna.json        # Aetna-specific rules
│   ├── cigna.json        # Cigna-specific rules
│   ├── humana.json       # Humana-specific rules
│   ├── medicare.json     # Medicare-specific rules
│   └── default.json      # Fallback for unknown payers
│
├── data/
│   ├── generate_sample_data.py   # Synthetic 837 data generator
│   └── sample_output/            # Validation reports land here
│
├── tests/
│   └── test_rules.py     # 20 unit tests across all rule functions
│
├── main.py               # CLI entry point
└── requirements.txt

Tech Stack and Why

Technology	Why Used
Python	Industry standard for healthcare data pipelines
Pandas	Vectorized processing of bulk claim batches
Dataclasses	Clean, typed data models without boilerplate
Enums	Type-safe status and severity values
JSON configs	Payer rules updatable without code changes
Python logging	Audit trail for every validation run
pytest	Unit test coverage on all rule functions

Design Decisions

Why separate payer configs from rule logic: Payer requirements change constantly. BCBS updates its NPI format, Cigna changes its timely filing window, a new payer contract comes in. With JSON configs, rule updates require zero Python changes — the engine picks them up automatically at runtime.

Why dataclasses for models: Every claim produces a ValidationResult carrying its claim ID, status, errors, and revenue impact. Dataclasses give us typed, immutable-friendly structures with auto-generated init and repr — no boilerplate, no inconsistency.

Why the Luhn algorithm for NPI validation: CMS mandates NPI validation uses the Luhn checksum algorithm. The 10th digit of every NPI is a check digit calculated from the first 9. Every payer clearinghouse runs this check. Our engine runs the same check — claims with invalid NPIs never leave the validation layer.

Why Open/Closed Principle for rules: The BASE_RULES list in rules.py is the extension point. Adding a new validation rule means adding one function and registering it in that list. The engine, reporter, and models don't change. This is how you build systems that scale without breaking.

How to Run

# Clone and set up
git clone https://github.com/lp07/healthcare-claims-dq-platform.git
cd healthcare-claims-dq-platform
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Generate synthetic data and validate
python main.py --generate

# Validate an existing claims file
python main.py --input path/to/your/claims.csv

# Run tests
python -m pytest tests/test_rules.py -v

Output Files

After each run, three files are written to data/sample_output/:

claims_validation_report_[timestamp].csv — Full claim-level results. Columns: claim_id, payer, status, billed_amount, error_count, error_codes, revenue_at_risk.

error_breakdown_[timestamp].csv — Per-error detail for every flagged/rejected claim.

summary_[timestamp].json — Aggregate KPIs: total claims, pass rate, total billed, revenue at risk, top error codes.

Extending the Platform

Add a new payer: Create payer_configs/newpayer.json with timely filing days, subscriber ID pattern, and required fields. No Python changes needed.

Add a new validation rule:

Write a function in rules.py that takes a claim dict and returns a list of ValidationError objects
Add the function to BASE_RULES list at the bottom of rules.py
Write a test in tests/test_rules.py

Connect to a real data source: Replace the CSV load in main.py with a SQL query, API call, or S3 read. The engine accepts any pandas DataFrame.

Domain Context

Built on direct experience processing claims across BCBS, Aetna, Cigna, Humana, and Medicare. The error patterns in this platform reflect real denial root causes observed in production RCM operations — particularly the NPI validation failures and payer-specific timely filing violations that account for the majority of preventable technical denials.

All data in this repository is synthetically generated. No real patient, provider, or payer data is used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthcare Claims Data Quality Intelligence Platform

The Problem This Solves

Architecture

Validation Rules

Base EDI 837 Rules (Universal — all payers)

Payer-Specific Rules (from JSON config)

Sample Run Results

Project Structure

Tech Stack and Why

Design Decisions

How to Run

Output Files

Extending the Platform

Domain Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
claims_validator		claims_validator
data		data
payer_configs		payer_configs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Healthcare Claims Data Quality Intelligence Platform

The Problem This Solves

Architecture

Validation Rules

Base EDI 837 Rules (Universal — all payers)

Payer-Specific Rules (from JSON config)

Sample Run Results

Project Structure

Tech Stack and Why

Design Decisions

How to Run

Output Files

Extending the Platform

Domain Context

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages