Skip to content

Fix bad-chromosome KeyError + last 7,612 drift (v20 §14.4 P1)#36

Merged
lucapinello merged 1 commit intochorus-applicationsfrom
fix/2026-04-21-chrom-validation-and-borzoi-count
Apr 21, 2026
Merged

Fix bad-chromosome KeyError + last 7,612 drift (v20 §14.4 P1)#36
lucapinello merged 1 commit intochorus-applicationsfrom
fix/2026-04-21-chrom-validation-and-borzoi-count

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Problem

The v20 Linux/CUDA audit ran LegNet against genomics edge cases and surfaced:

oracle.predict(('chrZZ', 100, 300), [...])
# → KeyError: 'H'

A chromosome not in the reference FASTA crashed deep inside the oracle's one-hot encoder with a message that told the user nothing. pysam's internal KeyError(chrom) was propagating past GenomeRef.slop() all the way to the PyTorch transform.

Fix

Catch KeyError in GenomeRef.slop() and re-raise as IntervalException with an actionable message (chorus/core/interval.py:51-62). After the fix:

IntervalException: Chromosome 'chrZZ' not found in <path>/hg38.fa.
Check that the chromosome name matches the reference (hg38 uses
'chr1'..'chr22', 'chrX', 'chrY', 'chrM').

Single-chokepoint fix in GenomeRef.slop. Every wide-window prediction path (Enformer, Borzoi, ChromBPNet, Sei, LegNet, AlphaGenome) runs through slop() to size the input interval, so one catch covers them all.

predict_variant_effect on the same bad chrom was already covered by extract_sequence's explicit check — verified still works.

Regression test

tests/test_prediction_methods.py::test_bad_chromosome_gives_actionable_error exercises both paths:

  • GenomeRef.slop directly (the old crash site)
  • predict_variant_effect (already-covered path, verified parity)

Also lands

scripts/build_backgrounds_borzoi.py:4 — docstring said "7,612 Borzoi tracks"; real count is 7,611. Last 7,612 in live code. Matches v17's fix to scripts/README.md. Originally in PR #35 which was closed as superseded by the other agent's v19/v20 work.

Test plan

  • pytest tests/ --ignore=tests/test_smoke_predict.py -q335 passed / 1 skipped (9m 16s; +1 new test)
  • Manual reproducer: oracle.predict(('chrZZ', ...)) now raises IntervalException with actionable text
  • Manual reproducer: oracle.predict_variant_effect('chrZZ:100-300', ...) still raises InvalidRegionError with named chrom
  • grep -rn '7,612' scripts/ chorus/ --include='*.py' --include='*.md' → empty

🤖 Generated with Claude Code

## Context

The v20 audit on a Linux/CUDA host ran genomics edge cases with
LegNet and surfaced one new P1:

    oracle.predict(('chrZZ', 100, 300), [...])
    → KeyError: 'H'

A chromosome not in the reference FASTA crashed deep inside the
oracle's one-hot encoder with a message that told the user nothing
about what went wrong. pysam's internal KeyError(chrom) was slipping
past GenomeRef.slop() and propagating all the way to the PyTorch
transform.

## Fix

Catch pysam's KeyError in `GenomeRef.slop()` (chorus/core/interval.py:51-62)
and re-raise as `IntervalException` with a message that names the
bad chromosome, the FASTA path, and reminds the user of the
canonical hg38 chromosome set. After the fix:

    IntervalException: Chromosome 'chrZZ' not found in <path>/hg38.fa.
    Check that the chromosome name matches the reference (hg38 uses
    'chr1'..'chr22', 'chrX', 'chrY', 'chrM').

`predict_variant_effect` on the same bad chrom was already covered
because it goes through `extract_sequence` which has an explicit
"Chromosome X not found" check — verified still works.

Scope: single-chokepoint fix in GenomeRef.slop. Downstream oracles
don't need individual checks because every wide-window prediction
path runs through slop() to size the input interval.

## Regression test

`tests/test_prediction_methods.py::test_bad_chromosome_gives_actionable_error`
covers both paths:
- GenomeRef.slop directly (the old crash site)
- predict_variant_effect (already-covered path, verified parity)

MockOracle._predict shortcircuits raw inputs to random data, so the
test exercises GenomeRef.slop unit-wise rather than through the mock
oracle chain.

## Also lands

scripts/build_backgrounds_borzoi.py:4 — docstring said "7,612 Borzoi
tracks"; real count is 7,611. Last 7,612 in live code. Matches v17
scripts/README.md fix. (Originally in PR #35 which was closed as
superseded.)

Tests: 335 passed / 1 skipped (fast suite), up from 334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lucapinello lucapinello merged commit 4c416e4 into chorus-applications Apr 21, 2026
1 check passed
@lucapinello lucapinello deleted the fix/2026-04-21-chrom-validation-and-borzoi-count branch April 22, 2026 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant