Skip to content

v27 scorched-earth: track-ID validator P0 + MCP --help + walkthrough#48

Merged
lucapinello merged 1 commit intomainfrom
fix/2026-04-24-v27-track-id-validator
Apr 25, 2026
Merged

v27 scorched-earth: track-ID validator P0 + MCP --help + walkthrough#48
lucapinello merged 1 commit intomainfrom
fix/2026-04-24-v27-track-id-validator

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Summary

Fresh-from-zero install audit (v27): deleted all 7 envs + 24.8 GB caches, reinstalled following README TLDR exactly as a new user would. Found one P0 plus two P1 docs items, all fixed in this PR.

P0 — track-ID validator rejected FANTOM CAGE identifiers

The v26 guard (#44) only treated ENCFF* strings as identifier candidates; everything else fell through to description lookup. CNhs11250 (FANTOM CAGE) is a valid identifier but doesn't start with ENCFF, so it was classified as a description, didn't match anything, and got rejected.

The shipped single_oracle_quickstart.ipynb uses ['ENCFF413AHU', 'CNhs11250'] and broke for every new user on cell In[8]:

InvalidAssayError: Enformer does not recognise these track IDs: ['CNhs11250']

Fix: try get_track_by_identifier first unconditionally; only fall back to description lookup if identifier lookup returns None. Same fix in chorus/oracles/borzoi.py (identical bug, identical code path). Regression test pinned in tests/test_prediction_methods.py.

P1 — chorus-mcp --help listed 20 tools, server registers 22

Hand-maintained list missed discover_variant and fine_map_causal_variant. Reorganised into 4 logical groups with explicit (22) tag so future drift is easy to spot.

P1 — dead anchor in docs/MCP_WALKTHROUGH.md

../README.md#mcp-server-ai-assistant-integration → nothing. README slug is #mcp-server. Updated.

Why this matters

Without the scorched-earth run, the v26-introduced P0 in _validate_assay_ids would have shipped to every new user. Previous "audits" (v22-v26) all kept some warm state (downloads cache, env, genome) so this code path was never exercised on a real first call.

Verification

Check Result
Delete 7 envs + 24.8 GB caches clean
mamba env create -f environment.yml (Step 1) 5 min, clean
chorus setup --oracle all (Step 2) 60 min, 6/6 ✓
README Step 3 Python snippet WT mean 0.468, 3 alts scored
chorus list / chorus health 6/6 ✓ Healthy, exit 0
single_oracle_quickstart.ipynb (post-fix) exit 0, all cells
MCP E2E (test_mcp_e2e_list_oracles_and_analyze_variant -m integration) 1 passed, 280s
Multi-oracle smoke (Enformer + ChromBPNet) both finite
Discovery smoke (discover_variant_effects SORT1 rs12740374) OK
Fast pytest 340 passed, 1 skipped

Deferred to a follow-up PR

Doc-drift items captured in the audit report but not fixed here:

  • 3 notebook md cells with redundant chorus genome download hg38 steps + outdated track-count tables
  • audits/AUDIT_CHECKLIST.md ambiguous track-count rows

🤖 Generated with Claude Code

…anchor

Greenfield install verified end-to-end after deleting all 7 envs +
24.8 GB of caches. README TLDR Steps 1-4 work; quickstart notebook
executes; MCP E2E test (4 min) passes; 6/6 oracles ✓ Healthy.

P0 — track-ID validator rejected FANTOM CAGE identifiers:

`_validate_assay_ids` in Enformer + Borzoi only treated `ENCFF*`
strings as identifier candidates; everything else fell through to
description-substring lookup. FANTOM CAGE IDs like `CNhs11250` are
valid identifiers (resolved by `get_track_by_identifier`) but don't
start with `ENCFF`, so they were classified as descriptions.
`get_tracks_by_description("CNhs11250")` returns empty → guard raised
`InvalidAssayError`. The shipped quickstart notebook uses
`['ENCFF413AHU', 'CNhs11250']` and broke for every new user on
cell In[8].

Fix: try `get_track_by_identifier` first unconditionally; fall back
to description lookup only if identifier lookup returns None. Same
fix in borzoi.py (identical bug, identical code path). Regression
test added in tests/test_prediction_methods.py pinning the FANTOM
CAGE behaviour explicitly.

P1 — `chorus-mcp --help` listed 20 tools, FastMCP registers 22:

Hand-maintained list missed `discover_variant` and
`fine_map_causal_variant`. Reorganised into 4 logical groups
(Discovery / Lifecycle / Predict / Analyze) with explicit "(22)" tag.

P1 — dead anchor in docs/MCP_WALKTHROUGH.md:

`../README.md#mcp-server-ai-assistant-integration` → nothing.
README slug is `#mcp-server`. Updated.

Audit report: audits/2026-04-24_v27_scorched_earth.md.

Tests: 340 passed, 1 skipped on fast suite. Quickstart notebook
executes clean post-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lucapinello lucapinello merged commit 2ecd998 into main Apr 25, 2026
1 check passed
@lucapinello lucapinello deleted the fix/2026-04-24-v27-track-id-validator branch April 25, 2026 01:15
lucapinello added a commit that referenced this pull request Apr 26, 2026
After audit/2026-04-26-bpnet-cdfs-complete merged into main, replayed
the full README quickstart from a fresh-install state (deleted 7 envs +
~53 GB of caches/downloads/genomes). Verified:

- Default `chorus setup --oracle chrombpnet` is back to the v27 fast
  path (K562 + HepG2 DNase only, ~16 min, 3.5 GB) — no longer the
  silent 30 GB / 3-hour download.
- `--all-chrombpnet` opt-in flag is properly advertised in
  `chorus setup --help`.
- 786-track NPZ auto-downloads from HF (commit c1e5fc1) on chorus
  setup --oracle chrombpnet.
- All 6 oracles ✓ Healthy in 67 min total.
- README Step 3 snippet: WT mean 0.468, 3 alts.
- All 3 shipped notebooks execute clean (single_oracle_quickstart,
  advanced_multi_oracle_analysis, comprehensive_oracle_showcase).
- 4/4 integration tests pass (MCP E2E, SEI+LegNet CDF download,
  ChromBPNet fresh single-model download).
- Fast pytest 340 / 1.

HTML walkthrough render audit (playwright on 18 shipped HTMLs):
  - 18/18 loaded with 0 JS errors / 0 page errors
  - 0/18 missing the glossary block
  - 17/18 with valid IGV browser block (the one without is
    batch_sort1_locus_scoring — by design, batch reports show a
    multi-variant table, not per-variant tracks)
  - All formula badges (log2FC / lnFC / Δ) and percentile columns
    present where applicable.

Default disk footprint after install: ~25 GB (matches new README
claim). The 60 GB figure only applies to --all-chrombpnet opt-in.

This is the second consecutive scorched-earth audit (v28 was the
first) to return clean. Six bug-fix PRs (#48, #49, #50, #51, #52,
#53) plus the audit-fix branch all hold up end-to-end.

No findings. Code is shipping-ready.

Audit + screenshots: audits/2026-04-26_v29_scorched_earth/

Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant