Fix bad-chromosome KeyError + last 7,612 drift (v20 §14.4 P1) by lucapinello · Pull Request #36 · pinellolab/chorus

lucapinello · 2026-04-21T20:04:10Z

Problem

The v20 Linux/CUDA audit ran LegNet against genomics edge cases and surfaced:

oracle.predict(('chrZZ', 100, 300), [...])
# → KeyError: 'H'

A chromosome not in the reference FASTA crashed deep inside the oracle's one-hot encoder with a message that told the user nothing. pysam's internal KeyError(chrom) was propagating past GenomeRef.slop() all the way to the PyTorch transform.

Fix

Catch KeyError in GenomeRef.slop() and re-raise as IntervalException with an actionable message (chorus/core/interval.py:51-62). After the fix:

IntervalException: Chromosome 'chrZZ' not found in <path>/hg38.fa.
Check that the chromosome name matches the reference (hg38 uses
'chr1'..'chr22', 'chrX', 'chrY', 'chrM').

Single-chokepoint fix in GenomeRef.slop. Every wide-window prediction path (Enformer, Borzoi, ChromBPNet, Sei, LegNet, AlphaGenome) runs through slop() to size the input interval, so one catch covers them all.

predict_variant_effect on the same bad chrom was already covered by extract_sequence's explicit check — verified still works.

Regression test

tests/test_prediction_methods.py::test_bad_chromosome_gives_actionable_error exercises both paths:

GenomeRef.slop directly (the old crash site)
predict_variant_effect (already-covered path, verified parity)

Also lands

scripts/build_backgrounds_borzoi.py:4 — docstring said "7,612 Borzoi tracks"; real count is 7,611. Last 7,612 in live code. Matches v17's fix to scripts/README.md. Originally in PR #35 which was closed as superseded by the other agent's v19/v20 work.

Test plan

pytest tests/ --ignore=tests/test_smoke_predict.py -q → 335 passed / 1 skipped (9m 16s; +1 new test)
Manual reproducer: oracle.predict(('chrZZ', ...)) now raises IntervalException with actionable text
Manual reproducer: oracle.predict_variant_effect('chrZZ:100-300', ...) still raises InvalidRegionError with named chrom
grep -rn '7,612' scripts/ chorus/ --include='*.py' --include='*.md' → empty

🤖 Generated with Claude Code

## Context The v20 audit on a Linux/CUDA host ran genomics edge cases with LegNet and surfaced one new P1: oracle.predict(('chrZZ', 100, 300), [...]) → KeyError: 'H' A chromosome not in the reference FASTA crashed deep inside the oracle's one-hot encoder with a message that told the user nothing about what went wrong. pysam's internal KeyError(chrom) was slipping past GenomeRef.slop() and propagating all the way to the PyTorch transform. ## Fix Catch pysam's KeyError in `GenomeRef.slop()` (chorus/core/interval.py:51-62) and re-raise as `IntervalException` with a message that names the bad chromosome, the FASTA path, and reminds the user of the canonical hg38 chromosome set. After the fix: IntervalException: Chromosome 'chrZZ' not found in <path>/hg38.fa. Check that the chromosome name matches the reference (hg38 uses 'chr1'..'chr22', 'chrX', 'chrY', 'chrM'). `predict_variant_effect` on the same bad chrom was already covered because it goes through `extract_sequence` which has an explicit "Chromosome X not found" check — verified still works. Scope: single-chokepoint fix in GenomeRef.slop. Downstream oracles don't need individual checks because every wide-window prediction path runs through slop() to size the input interval. ## Regression test `tests/test_prediction_methods.py::test_bad_chromosome_gives_actionable_error` covers both paths: - GenomeRef.slop directly (the old crash site) - predict_variant_effect (already-covered path, verified parity) MockOracle._predict shortcircuits raw inputs to random data, so the test exercises GenomeRef.slop unit-wise rather than through the mock oracle chain. ## Also lands scripts/build_backgrounds_borzoi.py:4 — docstring said "7,612 Borzoi tracks"; real count is 7,611. Last 7,612 in live code. Matches v17 scripts/README.md fix. (Originally in PR #35 which was closed as superseded.) Tests: 335 passed / 1 skipped (fast suite), up from 334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lucapinello merged commit 4c416e4 into chorus-applications Apr 21, 2026
1 check passed

lucapinello mentioned this pull request Apr 21, 2026

v21 fresh-install audit: data caches purged + re-downloaded, no findings #37

Closed

6 tasks

lucapinello deleted the fix/2026-04-21-chrom-validation-and-borzoi-count branch April 22, 2026 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bad-chromosome KeyError + last 7,612 drift (v20 §14.4 P1)#36

Fix bad-chromosome KeyError + last 7,612 drift (v20 §14.4 P1)#36
lucapinello merged 1 commit intochorus-applicationsfrom
fix/2026-04-21-chrom-validation-and-borzoi-count

lucapinello commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucapinello commented Apr 21, 2026

Problem

Fix

Regression test

Also lands

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant