Fix off-by-one in predict_variant_effect ref-allele check by lucapinello · Pull Request #32 · pinellolab/chorus

lucapinello · 2026-04-21T12:35:33Z

Problem

The ref-allele sanity check at chorus/core/base.py:322 was reading the base one position to the right of what the user asked for. So

predict_variant_effect(
    genomic_region="chr1:109274000-109275000",
    variant_position="chr1:109274968",
    alleles=["G", "T"],       # rs12740374, as in dbSNP/UCSC
    ...
)

triggered

WARNING  Provided reference allele 'G' does not match the genome at this
         position ('T'). Chorus will use the provided allele.

even though G at chr1:109274968 is exactly what dbSNP/UCSC/IGV/extract_sequence('chr1:109274968-109274968') all say. The substitution at line 335 then placed the ref/alt base at pos+1, so every shipped walkthrough was computing predictions at the wrong coordinate. The regulatory signal happens to be coherent across ±1 bp on rs12740374, rs1421085, rs1427407 etc., so effect directions still reproduced — which is why this has been shipping.

Root cause

genomic_region / variant_position strings follow dbSNP/UCSC/IGV and chorus.utils.extract_sequence — 1-based inclusive.
GenomeRef uses 0-based half-open (documented at interval.py:26).
predict_variant_effect handed the 1-based values straight to GenomeRef and to ref2query, conflating the two conventions.

Fix

Convert region_start and var_pos to 0-based at the point we build the Interval. That's it — just two -1s with an explanatory comment.

Verification

Tested against 7 famous SNPs — for every one, the internal ref-check now agrees with extract_sequence and dbSNP:

rsID	Gene	hg38	dbSNP ref	internal (before)	internal (after)
rs12740374	SORT1	chr1:109274968	G	T	G ✓
rs1421085	FTO	chr16:53767042	T	A	T ✓
rs1427407	BCL11A	chr2:60718347	G	C	G ✓
rs4988235	LCT	chr2:135851076	G (+ strand)	C	G ✓
rs334	HBB	chr11:5227002	T (+ strand)	C	T ✓
rs6025	F5 Leiden	chr1:169519049	T (+ strand)	C	T ✓
rs7412	APOE	chr19:44908822	C	G	C ✓

Also works for the MCP _auto_region case (a 2 bp region exactly spanning the variant) — the old code silently read pos+1; the fix correctly lands on pos.

New regression test

tests/test_prediction_methods.py::test_variant_position_is_1_based builds a synthetic genome with a single G anchor at 1-based position 100,000 and asserts:

no warning when the user passes ref='G'
warning does fire when the user passes ref='T' (proves the check still catches real mismatches — we haven't just silenced it)

Scope

Only predict_variant_effect is touched. predict_region_replacement and predict_region_insertion_at also treat their string coordinates as 0-based half-open internally; changing those would silently shift existing sequence-engineering results by 1 bp, so that's deferred to a separate PR with example-output regeneration.

Downstream consequence

Variant effect predictions will now substitute the ref/alt base at the correct 1-based position. Effect sizes in the shipped walkthroughs may drift by a small amount relative to the currently committed example_output.* files (±1 bp shift in where the substitution lands). Recommend regenerating those in a follow-up once this merges — the effect direction and magnitudes should stay within CPU non-determinism for AlphaGenome, and will likely shift more noticeably for ChromBPNet (1 bp native resolution) where a 1-bp shift lands on a different bin.

Test plan

pytest tests/ --ignore=tests/test_smoke_predict.py -q → 334 passed / 1 skipped (up from 333 — the new test)
Manual check against 7 famous SNPs — all consistent
Regression test validates both the fixed path and the still-working mismatch path

🤖 Generated with Claude Code

The ref-allele sanity check at chorus/core/base.py:322 was reading the base one position to the right of what the user asked for, so predict_variant_effect( genomic_region="chr1:109274000-109275000", variant_position="chr1:109274968", alleles=["G", "T"], # rs12740374, dbSNP/UCSC ...) triggered WARNING Provided reference allele 'G' does not match the genome at this position ('T'). Chorus will use the provided allele. even though the user-supplied ref allele is exactly what dbSNP/UCSC/IGV say is at that position. The substitution at line 335 then placed the ref/alt base one position off the user's intended coordinate, so every shipped walkthrough was computing predictions at pos+1 rather than pos. The regulatory signal happens to be coherent across ±1 bp on rs12740374 et al., so effect *directions* still reproduce — but the check was broken and the warning was misleading. Root cause: `genomic_region`/`variant_position` strings follow dbSNP/ UCSC/IGV and `chorus.utils.extract_sequence` — 1-based inclusive — but `GenomeRef` uses 0-based half-open (documented at interval.py:26). predict_variant_effect was handing the 1-based values straight to GenomeRef and to `ref2query`, conflating the two conventions. Fix: convert `region_start` and `var_pos` to 0-based at the point we build the Interval. Verified against 7 famous SNPs (rs12740374, rs1421085, rs1427407, rs4988235, rs334, rs6025, rs7412) — internal ref-check now agrees with extract_sequence for all of them and no warning fires when the user supplies the correct dbSNP ref allele. Also includes a regression test (tests/test_prediction_methods.py:: test_variant_position_is_1_based) that builds a synthetic genome with a single 'G' anchor at 1-based position 100,000 and asserts: - no warning when user passes ref='G' - warning DOES fire when user passes ref='T' (proves the check still catches real mismatches — we haven't just silenced it) Scope: this PR only fixes predict_variant_effect. predict_region_replacement and predict_region_insertion_at treat their coordinates as 0-based half- open internally; changing that would silently shift existing users' sequence replacements by 1 bp, so it's deferred to a separate PR with full example-output regeneration. Consequence for shipped walkthroughs: variant effect predictions will now substitute the ref/alt base at the correct 1-based position. Effect sizes may drift by a small amount (1-bp shift relative to existing committed numbers); regenerate example_output.* as follow-up. Tests: 334 passed / 1 skipped (fast suite, 8m 34s), up from 333. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed position Pushback from user on v23 initial report: "did you run all the notebooks? run all the MCP walkthroughs? check tracks, findings?" v23 had only installed enformer + run pytest. Re-opened the audit to cover the missing pieces. ## Notebooks run end-to-end (scorched-earth env) | Notebook | Cells | Errors | Warnings | |---|---|---|---| | single_oracle_quickstart.ipynb | 49 | 0 | 0 | | advanced_multi_oracle_analysis.ipynb (pre-fix) | 127 (57 code) | 0 | **1** | | advanced_multi_oracle_analysis.ipynb (post-fix) | 127 (57 code) | 0 | **0** | | comprehensive_oracle_showcase.ipynb | aborts cell 9 | 1 (expected) | — | comprehensive aborts on `No module named 'borzoi_pytorch'` — Borzoi env not installed in this scorched-earth scope. Expected. ## P2 fix landed in this PR examples/notebooks/advanced_multi_oracle_analysis.ipynb cell 67 had `first_G_position_in_int = 108`. That hardcoded offset was calibrated to the pre-v19 off-by-one in predict_variant_effect — 108 only pointed at the G of `CCAGAGGGC` because the ref-check was reading one position to the right of what the user said. Post-PR #32, the code correctly reads the base at the user-given position, and interval-offset 108 is the A in `CCAGAGGGC` — not the first G. The warning Provided reference allele 'G' does not match the genome at this position ('A'). Chorus will use the provided allele. was scientifically real: Chorus was substituting G at the A position and predicting "mutating the A before the motif" while the notebook claimed it was predicting "mutating the first G of the CTCF motif". Shipped notebook text said one thing; the actual computation tested another. Fix: `108 → 109` so variant_pos lands on 1-based chr2:246676 = the first G of CCAGAGGGC. Verified via extract_sequence. Re-ran the notebook post-fix: 0 errors, 0 warnings. ## MCP server end-to-end Spawned chorus-mcp over stdio via fastmcp.Client + StdioTransport. 3 tool calls succeeded: list_oracles (6 oracles, correct specs), list_tracks('enformer') (4 assay types + 1267 cell types), oracle_status ({"loaded_oracles":[]}). ## Extra oracles installed - chorus setup --oracle chrombpnet → ✓ 9m 2s (env + ATAC:K562 fold 0 from ENCODE + background + hg38 already present); marker written. - chorus setup --oracle legnet → ✓ ~2m (tiny weights); marker written. Both end with chorus health → Healthy. ## Walkthrough spot-check scripts/regenerate_multioracle.py --oracle chrombpnet reproduces the committed chrombpnet effect size for rs12740374 G>T within 2e-6 — CPU non-determinism; reverted the regen. ## Docs consistency Every tool name referenced in examples/walkthroughs/**/README.md (9 unique) exists in the MCP registry. No orphans. ## Deferred (not exercised in this v23 scope) - borzoi, sei, alphagenome setup/predict/walkthroughs — need `chorus setup --oracle all` (2–4h) + HF_TOKEN for AG - comprehensive_oracle_showcase.ipynb (needs all 6 oracles) - AG-primary walkthroughs (variant_analysis/SORT1/BCL11A/FTO, validation/CEBP/TERT, discovery/SORT1, causal/SORT1_locus, sequence_engineering, batch_scoring) — previously verified in v21/v22 Fast suite in fresh env: 338 passed / 2 skipped (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… findings (#41) * v23 scorched-earth install audit: full zero→working Enformer in 14 min User explicitly authorized the destructive end-to-end install test: "scorched-to-the-earth install, remove all the chorus envs, then reinstall from scratch as a user following the docs" ## What was nuked - 7 mamba envs: chorus, chorus-enformer, chorus-borzoi, chorus-chrombpnet, chorus-sei, chorus-legnet, chorus-alphagenome - ~/.chorus/backgrounds/ (1.5 GB) - genomes/hg38.* (3.1 GB) - downloads/* (43 GB) - **Total reclaimed: ~47 GB + all envs** ## Install, exactly as README prescribes mamba env create -f environment.yml 2m 30s pip install -e . 3 s chorus setup --oracle enformer 10m 54s Total wall clock zero → "enformer ready": ~13 min 30 s (dominated by 9m 26s UCSC hg38 download — network, not code). ## Verification - chorus health → ✓ enformer: Healthy (6.5 s; was 720 s hang pre-3735ea5) - Real Python prediction on GATA1 TSS (chrX:48777634-48790694, 2 tracks) → DNASE:K562 mean=0.4842 max=22.4620, CAGE:K562 mean=0.5957 max=120.8120. Within CPU non-determinism of committed notebook values (0.4841/0.5953, 22.4569/120.7759). - pytest in fresh-built env: 338 passed / 2 skipped in 45.67 s (2 skipped = integration tests correctly guarding on missing .chorus_setup_v1 for non-enformer oracles) ## Findings None. The documented install path works exactly as claimed — no manual fix-ups, no missing deps, no surprises. README fresh-install instructions are accurate. ## Scope notes - Only chorus + chorus-enformer rebuilt; the other 5 per-oracle envs are still deleted. `chorus setup --oracle all` would add 2-4 h and 80 GB across TF, PyTorch ×3, JAX envs. - HF gate not exercised (not AlphaGenome here); already verified in v22. Artefacts: audits/2026-04-23_v23_scorched_earth/logs/ (8 files) plus report.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v23 addendum: notebooks + MCP + P2 fix for pre-v19-calibrated hardcoded position Pushback from user on v23 initial report: "did you run all the notebooks? run all the MCP walkthroughs? check tracks, findings?" v23 had only installed enformer + run pytest. Re-opened the audit to cover the missing pieces. ## Notebooks run end-to-end (scorched-earth env) | Notebook | Cells | Errors | Warnings | |---|---|---|---| | single_oracle_quickstart.ipynb | 49 | 0 | 0 | | advanced_multi_oracle_analysis.ipynb (pre-fix) | 127 (57 code) | 0 | **1** | | advanced_multi_oracle_analysis.ipynb (post-fix) | 127 (57 code) | 0 | **0** | | comprehensive_oracle_showcase.ipynb | aborts cell 9 | 1 (expected) | — | comprehensive aborts on `No module named 'borzoi_pytorch'` — Borzoi env not installed in this scorched-earth scope. Expected. ## P2 fix landed in this PR examples/notebooks/advanced_multi_oracle_analysis.ipynb cell 67 had `first_G_position_in_int = 108`. That hardcoded offset was calibrated to the pre-v19 off-by-one in predict_variant_effect — 108 only pointed at the G of `CCAGAGGGC` because the ref-check was reading one position to the right of what the user said. Post-PR #32, the code correctly reads the base at the user-given position, and interval-offset 108 is the A in `CCAGAGGGC` — not the first G. The warning Provided reference allele 'G' does not match the genome at this position ('A'). Chorus will use the provided allele. was scientifically real: Chorus was substituting G at the A position and predicting "mutating the A before the motif" while the notebook claimed it was predicting "mutating the first G of the CTCF motif". Shipped notebook text said one thing; the actual computation tested another. Fix: `108 → 109` so variant_pos lands on 1-based chr2:246676 = the first G of CCAGAGGGC. Verified via extract_sequence. Re-ran the notebook post-fix: 0 errors, 0 warnings. ## MCP server end-to-end Spawned chorus-mcp over stdio via fastmcp.Client + StdioTransport. 3 tool calls succeeded: list_oracles (6 oracles, correct specs), list_tracks('enformer') (4 assay types + 1267 cell types), oracle_status ({"loaded_oracles":[]}). ## Extra oracles installed - chorus setup --oracle chrombpnet → ✓ 9m 2s (env + ATAC:K562 fold 0 from ENCODE + background + hg38 already present); marker written. - chorus setup --oracle legnet → ✓ ~2m (tiny weights); marker written. Both end with chorus health → Healthy. ## Walkthrough spot-check scripts/regenerate_multioracle.py --oracle chrombpnet reproduces the committed chrombpnet effect size for rs12740374 G>T within 2e-6 — CPU non-determinism; reverted the regen. ## Docs consistency Every tool name referenced in examples/walkthroughs/**/README.md (9 unique) exists in the MCP registry. No orphans. ## Deferred (not exercised in this v23 scope) - borzoi, sei, alphagenome setup/predict/walkthroughs — need `chorus setup --oracle all` (2–4h) + HF_TOKEN for AG - comprehensive_oracle_showcase.ipynb (needs all 6 oracles) - AG-primary walkthroughs (variant_analysis/SORT1/BCL11A/FTO, validation/CEBP/TERT, discovery/SORT1, causal/SORT1_locus, sequence_engineering, batch_scoring) — previously verified in v21/v22 Fast suite in fresh env: 338 passed / 2 skipped (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix 2 P1 Sei bugs + install all 6 + all 3 notebooks pass (v23 addendum 2) User pushback #2 ("don't be lazy") prompted finishing the scorched- earth install for the remaining 3 oracles (borzoi, sei, alphagenome). Two real P1 bugs surfaced in the Sei setup path and were fixed here. ## P1-A — Sei setup: shutil.SameFileError on fresh install chorus setup --oracle sei from zero hit: ✗ prefetch failed for sei: - weights: SameFileError: PosixPath('.../seqclass_info.txt') and PosixPath('.../seqclass_info.txt') are the same file Root cause: chorus/oracles/sei.py:595-596 did `shutil.copy(info_file_path, self.get_classes_names())`. get_classes_names() falls back to the packaged source path when the download_dir cache doesn't exist yet — which is exactly the state on first install. So src == dst, SameFileError. ## P1-B — Sei setup: cache-not-materialized on re-run After fixing P1-A (only copy when src != dst), a re-run produced "✓ sei ready" but chorus health still reported **Not installed**. Root cause: load_pretrained_model short-circuits the download block when get_classes_names().exists() returns True. On re-install, the cache file doesn't exist but the packaged fallback does — so the .exists() check returns True via fallback, _download_sei_model is skipped, and the one-time copy into downloads/sei/model/ never runs. But the health probe (chorus/core/weights_probe.py::_probe_sei) looks for downloads/sei/model/seqclass_info.txt specifically. ## Fix chorus/oracles/sei.py: 1. Extracted the copy into _materialize_cached_seqclass_info() — idempotent helper that: - early-returns if the cache file already exists - early-returns if source doesn't exist (defensive) - mkdirs parent, compares resolved paths, skips if same file 2. Calls it at the END of load_pretrained_model regardless of whether _download_sei_model ran. Guarantees the probe target exists whenever an oracle loads successfully. 3. Also calls it from _download_sei_model to catch the first-install path deterministically. Verified: `chorus setup --oracle sei` on a purged state writes downloads/sei/model/seqclass_info.txt; chorus health → ✓ sei: Healthy. Re-run on existing install: same result. ## All 6 oracles + all 3 notebooks verified Final chorus health: **every oracle Healthy**. | Notebook | Cells | Errors | Warnings | |---|---|---|---| | single_oracle_quickstart.ipynb | 49 | 0 | 0 | | comprehensive_oracle_showcase.ipynb | 59 | 0 | 0 | | advanced_multi_oracle_analysis.ipynb (post-108→109 fix) | 127 | 0 | 0 | Fast test suite with all 6 envs: 339 passed / 1 skipped. ## Artefacts 10 more log files documenting borzoi/sei/AG installs, the sei retry after each fix, the final health check, the comprehensive notebook run, and the full pytest run. Plus comprehensive_oracle_showcase_fresh.ipynb (3000+ lines of executed output). ## Security note HF_TOKEN + LDLINK_TOKEN the user pasted this session live only in the conversation transcript — not in any on-disk file or commit. Logs grep-redact any hf_<token> pattern; no AKIA-style keys anywhere. Token use was session-scoped export; unset after each invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lucapinello merged commit b6aed3d into chorus-applications Apr 21, 2026
1 check passed

This was referenced Apr 21, 2026

v18 fresh full audit: reusable AUDIT_CHECKLIST + 3 HF gate fixes #33

Closed

v21 fresh-install audit: data caches purged + re-downloaded, no findings #37

Closed

lucapinello deleted the fix/2026-04-21-ref-allele-off-by-one branch April 22, 2026 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix off-by-one in predict_variant_effect ref-allele check#32

Fix off-by-one in predict_variant_effect ref-allele check#32
lucapinello merged 1 commit intochorus-applicationsfrom
fix/2026-04-21-ref-allele-off-by-one

lucapinello commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucapinello commented Apr 21, 2026

Problem

Root cause

Fix

Verification

New regression test

Scope

Downstream consequence

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant