audit: rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default) by lucapinello · Pull Request #70 · pinellolab/chorus

lucapinello · 2026-04-30T10:21:41Z

Summary

Built a fresh `chrombpnet_pertrack.npz` (786 tracks, 78.6 MB, sha256 `be61e9e8...`) on Linux/CUDA (2× A100 80 GB) following the handoff at `audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md`. NPZ validated and ready for upload to `lucapinello/chorus-backgrounds`; upload itself is blocked on a write-scope HF token (the cached read-only token returned 403).

Headline numbers

Check	Result
Track count	786 (22 ATAC + 20 DNASE + 744 CHIP) — matches handoff's expected count
Effect / summary / perbin reservoirs	full at 9609 / 29004 / 928128 — zero failed model loads
All CDFs monotone (786 rows × 3 CDFs)	✅
NaN / Inf in any CDF	✅ none
Wall total	~10 h with parallel-GPU phases (vs handoff estimate 13-25 h)

Magnitude shift vs. 0.2.x CDFs (matches handoff prediction of 5-30 %)

Track	old p95	new p95	shift %
ATAC:K562	0.1621	0.1976	+21.9%
DNASE:HepG2	0.1884	0.2436	+29.3%
ATAC:HepG2	0.1332	0.1517	+13.8%
ATAC:GM12878	0.1692	0.1934	+14.3%
DNASE:K562	0.1462	0.1659	+13.5%
CHIP:K562:REST	0.0529	0.0529	0.0%
CHIP:HepG2:CTCF	0.0829	0.0828	−0.1%

ATAC/DNase shifts reflect the bias correction (`chrombpnet_nobias` strips the enzymatic motif preferences the bias-aware variant carried). CHIP/BPNet tracks unchanged because BPNet's catalogue is already nobias-equivalent.

What's blocking the upload

The cached HF token in this env (`HfApi().whoami() → lucapinello, auths: []`) is read-only:

```
huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: you must use a write token to upload to a repository.
```

To finish the rebuild: maintainer re-logs in with a write-scope token (or sets `HF_TOKEN` to one) and runs the upload command from `report.md` "Upload status". The NPZ at `~/.chorus/backgrounds/chrombpnet_pertrack.npz` is unmodified since the merge step — same SHA256 as recorded in the report.

Two script bugs surfaced (worth filing as follow-ups; recoverable)

`--part variants/baselines --assay X` overwrites the interim NPZ. Following the handoff's documented sequence (`--assay ATAC_DNASE` then `--assay CHIP`) silently lost the 42 ATAC/DNase tracks during the merge — the second invocation replaced the interim file rather than appending. Caught by the post-merge spot-check (744 ≠ 786). Recovered via a Phase 1 redo (~50 min) + `--part merge-incremental`. The script should either append-on-write or refuse to overwrite without `--force`.
`--gpu N` silently overrides outer `CUDA_VISIBLE_DEVICES`. `scripts/build_backgrounds_chrombpnet.py:200` does `os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)` unconditionally, so the parallel-launch pattern `CUDA_VISIBLE_DEVICES=1 ... --gpu 0` puts the second job on physical GPU 0 instead of 1. Hit during the Phase 1 redo and OOM'd. The script should respect a pre-set `CUDA_VISIBLE_DEVICES` if present and only set it from `--gpu N` when no env var was provided.

Files

`audits/2026-04-29_chrombpnet_cdf_rebuild/report.md` — full audit narrative
9 build logs (`bg_chrombpnet_*.log`) — per-phase, including the failed first-redo for evidence
The NPZ itself is not in this PR (78 MB) — sha256 captured in the report so the maintainer can verify post-upload

Test plan

Maintainer authenticates with a write-scope HF token and runs the upload command in `report.md`
After upload, verify the NPZ on HF has sha256 `be61e9e8f9b919b43c599b7fbc9deb74f8f1e6dc1da5e2cdb92036a85bf13205`
Spot-check that `chorus.analysis.normalization.get_pertrack_normalizer("chrombpnet")` picks up the new NPZ (auto-download from HF)
Ranking sanity: pick one large-effect SNP at a known regulatory element (e.g. SORT1 rs12740374) and confirm relative cell-type rankings preserved between old and new CDFs

🤖 Generated with Claude Code

…ult) Built a fresh `chrombpnet_pertrack.npz` (786 tracks, 78.6 MB, sha256 be61e9e8...) on Linux/CUDA (2× A100 80 GB) following the handoff at audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md. NPZ validated and ready for upload; upload itself is blocked on a write-scope HF token (the cached read-only token returned 403). See report.md "Upload status" for the maintainer command. Headline numbers: - 786 tracks (22 ATAC + 20 DNASE + 744 CHIP) — matches expected count - All CDFs monotone, no NaN/Inf - Every reservoir filled to spec (effect 9609, summary 29004, perbin 928128) — zero failed model loads - ~10 h wall total with parallel-GPU phases (vs handoff estimate 13-25 h) Magnitude shift vs 0.2.x CDFs (matches handoff prediction of 5-30 % on ambiguous cases / ~0 % on stable ones): - ATAC/DNase tracks: +13.5 % to +29.3 % at p95 — bias correction stripped the enzymatic motif preferences the bias-aware variant carried - CHIP/BPNet tracks: ~0 % — BPNet's catalogue is already nobias-equivalent Two follow-ups surfaced (filing as separate issues): - F1: `--part variants/baselines --assay X` overwrites the interim NPZ rather than appending. Following the handoff verbatim caused a recoverable (~50 min cost) silent data loss; recovered via Phase 1 redo + `--part merge-incremental`. - F2: `--gpu N` arg silently overrides outer `CUDA_VISIBLE_DEVICES`, which mis-routed parallel jobs onto the same physical GPU. Files: - report.md: full audit narrative, spot-check, diff, follow-ups - bg_chrombpnet_*.log (9 files): per-phase build logs Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cached HF token in the rebuild box was read-only; maintainer supplied a write-scope token after the first-pass 403. NPZ uploaded as commit 008483049f8ba75701190db3c17077343c52beb5 on lucapinello/chorus-backgrounds. Verified by re-downloading from HF and comparing sha256 — matches the locally-built file (be61e9e8...) exactly. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…streamline README top (#76) * fix(cdf-rebuild): protect interim NPZ + honour CUDA_VISIBLE_DEVICES; streamline README top Three changes from the 0.4.0 follow-up triage. Single PR because the HANDOFF.md tweak depends on both script fixes. ## Bug A (P1, #71/#73): refuse-to-overwrite the interim NPZ `scripts/build_backgrounds_chrombpnet.py` was writing the interim NPZ unconditionally with `np.savez_compressed(interim_path, ...)`. The documented two-pass flow (`--assay ATAC_DNASE` then `--assay CHIP`) silently overwrote the first pass's 42-track interim with the second pass's 744 CHIP tracks, producing a 744-track final NPZ instead of the expected 786 — caught by post-merge spot-check during PR #70 (~50 min GPU recovery). Conservative fix per the user's pick: refuse to overwrite without `--force`, naming the conflicting track-id sets in the SystemExit message. The new `_check_interim_compatibility()` helper runs before each interim write site and: - returns silently if the path doesn't exist (first run); - returns silently if the existing track set equals the new set (idempotent re-run with no data loss); - raises SystemExit with a diff naming `len(only_existing)`, `len(only_new)`, and 3 example track-ids from each side, plus pointing at `--part merge` / `merge-incremental` / `--force`; - returns silently with `--force`. ## Bug B (P2, #72/#74): honour pre-set CUDA_VISIBLE_DEVICES `load_models_and_setup()` was clobbering `os.environ["CUDA_VISIBLE_DEVICES"]` with `--gpu N` unconditionally, so the documented parallel-launch pattern (outer `CUDA_VISIBLE_DEVICES=N` per terminal pinning the physical GPU) didn't work — both terminals landed on physical GPU 0, fighting for memory until one OOMed. Trivial pattern per the user's pick: only set the env var when nothing was already set: if "CUDA_VISIBLE_DEVICES" not in os.environ: os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu) Plus a one-line `--help` clarification on `--gpu`: "No-op if CUDA_VISIBLE_DEVICES is already set in the calling shell." ## README streamlining User feedback: the first two sections of "🚀 Get running in one lunch break" were too dense for a first-timer. Specifically the ~150-word Prerequisite paragraph after Step 1 (Miniforge, ~28 GB, both AlphaGenome backends, ChromBPNet streaming, lazy downloads, --all-chrombpnet, ENCODE fallback) and the 5-line Step 2 blockquote duplicating backend detail. Restructure (top section now is just): four-step intro, "Before you start" mini-section with three bullets (Miniforge link, ~28 GB, platforms), Steps 1-4 each terse and copy-pasteable. The detail moves to existing deeper sections — the Step 2 blockquote was already duplicated in the deeper "Two AlphaGenome backends" subsection (same content, more detail), so just drops; the Prerequisite paragraph's disk-usage detail moves into a new `#### Disk usage breakdown` subsection at the top of `Installation — detailed`. Anchor links (`#disk-usage-breakdown`, `#two-alphagenome-backends`, `#where-the-oracle-weights-come-from`) all resolve. ## HANDOFF.md Adds two short notes so future maintainers don't trip over the new behaviour: a callout between Phase 1 and Phase 2 reminding to run `--part merge` (or pass `--force`) before re-running with a different `--assay`; and a one-line note in the parallel-launch section explaining the `CUDA_VISIBLE_DEVICES` precedence. ## Tests `pytest -m "not integration and not slow"` → 368 passed, 1 skipped, 5 deselected — same count as main, no regression. Bug A's SystemExit-vs-overwrite branches are exercised by the helper's pure function; a smoke test that builds two tiny interims and asserts the diff message is left as a follow-up (would need conda-env infra in CI to actually run the script — out of scope for this PR). Closes #71, #72, #73, #74. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * README: drop the "Want to start in 2 minutes?" callout from Step 2 User feedback as part of the streamlining pass. The callout was an escape hatch into a one-oracle-only install for impatient readers, but it sits between Step 2 (the canonical full setup) and Step 3 (the runnable snippet that uses Enformer anyway), and a first-time reader following the linear flow doesn't need it interrupting the narrative. Anyone who specifically wants the lightweight starter will find `chorus setup --oracle <name>` documented in `Installation — detailed → Setting up oracle environments one-by-one` (already present, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * README: more energetic Step 2/3/4 + reorder What-to-read-next User feedback on the streamlined top: Step 2 — title was a heavy spec ("Download all 6 oracles + hg38 + backgrounds"); replaced with "Get every oracle, weight, and reference — batteries included". Body trimmed of the multi-GB-tarball / CDF detail that new users don't care about. Step 3 — title was utilitarian ("Predict — wild-type + SNP effect in one block"); replaced with "Your first prediction — score a SNP at the β-globin locus". Added two intro sentences above the code block explaining what the snippet does and why this prediction shape (one wild-type signal + N counter-factual variants) is the canonical chorus pattern. Step 4 — biggest reorder. Old version led with "ships an MCP server with 22 tools, here's the full list" before any natural- language example. New version flips that: lead with one bash command + three concrete prompts the user can paste into Claude Code, then mention the 22-tool catalogue at the end as the deeper read. Title rewritten to convey "complex analyses without coding" ("Skip the code — drive chorus from Claude in plain English"). What-to-read-next reordered so the first two bullets are the discovery/exploration paths (Notebooks, Worked application examples — both prompt-driven) instead of the API recipes. API slipped to fourth. No code or anchor changes; all internal links still resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * README: tighten "What chorus is" + punch up application examples + Notebooks + MCP intros Second-pass review. The audit was: where does the reader's interest flag and why? Five spots fixed: What chorus is — was three redundant bullet lists. The cover tagline already names the six oracles, and the "Pick an oracle" table right below has the per-oracle stats. The "Key features" sub-bullet list duplicated the lunch-break tour AND the later Key features section. Replaced with two short paragraphs that say what the cover doesn't: percentile-grounded outputs (`+0.45 log2FC → 0.962 effect %ile`), per- oracle conda isolation, chorus-controlled HF mirrors, 22-tool MCP. Worked application examples — title was a heading, now leads. New title: "Worked application examples — seven things you can do today." Intro now opens with the magic ("every example was generated end-to- end by Claude Code talking to chorus's MCP server, no code written by hand") instead of burying it. Notebooks — was "Three notebooks are provided, from introductory to advanced". Replaced with "Three sittings, zero to confident" + a sentence describing the user's actual progression. Per-notebook descriptions rewritten in second-person ("what you'll build") with plain-English summaries — the "I get it now" notebook, the "graduate- level" notebook. Pick an oracle — moved the wall-of-prose paragraph about CUDA / Apple Metal / tensorflow-metal / PyTorch MPS / JAX-Metal-falls-back-to-CPU out of "Pick an oracle". Replaced with a one-line "GPU detection is automatic" pointer to a new Platform & GPU support table inside Installation — detailed, where someone actually deciding what to install can find it. MCP server — was "Chorus includes an MCP server that lets AI assistants like Claude directly load oracles, predict variant effects, and analyze gene expression — all through natural language conversation." Lukewarm. Replaced with a one-liner that ties back to Step 4 of the lunch-break tour (which the reader just finished, and which is what hyped them in the first place) and previews what's in the rest of the section. Net diff: -7 LOC. Doc is shorter, less redundant, and punches harder where it matters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lucapinello and others added 2 commits April 30, 2026 10:20

lucapinello marked this pull request as ready for review April 30, 2026 10:46

This was referenced Apr 30, 2026

scripts/build_backgrounds_chrombpnet.py: --part variants/baselines --assay X overwrites the interim NPZ #71

Closed

scripts/build_backgrounds_chrombpnet.py: --gpu N silently overrides outer CUDA_VISIBLE_DEVICES #72

Closed

lucapinello merged commit 8cf1cd9 into main Apr 30, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audit: rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)#70

audit: rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)#70
lucapinello merged 2 commits intomainfrom
audit/2026-04-29-chrombpnet-cdf-rebuild

lucapinello commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucapinello commented Apr 30, 2026

Summary

Headline numbers

Magnitude shift vs. 0.2.x CDFs (matches handoff prediction of 5-30 %)

What's blocking the upload

Two script bugs surfaced (worth filing as follow-ups; recoverable)

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant