Fix bad-chromosome KeyError + last 7,612 drift (v20 §14.4 P1)#36
Merged
lucapinello merged 1 commit intochorus-applicationsfrom Apr 21, 2026
Merged
Conversation
## Context
The v20 audit on a Linux/CUDA host ran genomics edge cases with
LegNet and surfaced one new P1:
oracle.predict(('chrZZ', 100, 300), [...])
→ KeyError: 'H'
A chromosome not in the reference FASTA crashed deep inside the
oracle's one-hot encoder with a message that told the user nothing
about what went wrong. pysam's internal KeyError(chrom) was slipping
past GenomeRef.slop() and propagating all the way to the PyTorch
transform.
## Fix
Catch pysam's KeyError in `GenomeRef.slop()` (chorus/core/interval.py:51-62)
and re-raise as `IntervalException` with a message that names the
bad chromosome, the FASTA path, and reminds the user of the
canonical hg38 chromosome set. After the fix:
IntervalException: Chromosome 'chrZZ' not found in <path>/hg38.fa.
Check that the chromosome name matches the reference (hg38 uses
'chr1'..'chr22', 'chrX', 'chrY', 'chrM').
`predict_variant_effect` on the same bad chrom was already covered
because it goes through `extract_sequence` which has an explicit
"Chromosome X not found" check — verified still works.
Scope: single-chokepoint fix in GenomeRef.slop. Downstream oracles
don't need individual checks because every wide-window prediction
path runs through slop() to size the input interval.
## Regression test
`tests/test_prediction_methods.py::test_bad_chromosome_gives_actionable_error`
covers both paths:
- GenomeRef.slop directly (the old crash site)
- predict_variant_effect (already-covered path, verified parity)
MockOracle._predict shortcircuits raw inputs to random data, so the
test exercises GenomeRef.slop unit-wise rather than through the mock
oracle chain.
## Also lands
scripts/build_backgrounds_borzoi.py:4 — docstring said "7,612 Borzoi
tracks"; real count is 7,611. Last 7,612 in live code. Matches v17
scripts/README.md fix. (Originally in PR #35 which was closed as
superseded.)
Tests: 335 passed / 1 skipped (fast suite), up from 334.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The v20 Linux/CUDA audit ran LegNet against genomics edge cases and surfaced:
A chromosome not in the reference FASTA crashed deep inside the oracle's one-hot encoder with a message that told the user nothing. pysam's internal
KeyError(chrom)was propagating pastGenomeRef.slop()all the way to the PyTorch transform.Fix
Catch
KeyErrorinGenomeRef.slop()and re-raise asIntervalExceptionwith an actionable message (chorus/core/interval.py:51-62). After the fix:Single-chokepoint fix in
GenomeRef.slop. Every wide-window prediction path (Enformer, Borzoi, ChromBPNet, Sei, LegNet, AlphaGenome) runs throughslop()to size the input interval, so one catch covers them all.predict_variant_effecton the same bad chrom was already covered byextract_sequence's explicit check — verified still works.Regression test
tests/test_prediction_methods.py::test_bad_chromosome_gives_actionable_errorexercises both paths:GenomeRef.slopdirectly (the old crash site)predict_variant_effect(already-covered path, verified parity)Also lands
scripts/build_backgrounds_borzoi.py:4— docstring said "7,612 Borzoi tracks"; real count is 7,611. Last7,612in live code. Matches v17's fix toscripts/README.md. Originally in PR #35 which was closed as superseded by the other agent's v19/v20 work.Test plan
pytest tests/ --ignore=tests/test_smoke_predict.py -q→ 335 passed / 1 skipped (9m 16s; +1 new test)oracle.predict(('chrZZ', ...))now raisesIntervalExceptionwith actionable textoracle.predict_variant_effect('chrZZ:100-300', ...)still raisesInvalidRegionErrorwith named chromgrep -rn '7,612' scripts/ chorus/ --include='*.py' --include='*.md'→ empty🤖 Generated with Claude Code