Skip to content

audit: rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)#70

Merged
lucapinello merged 2 commits intomainfrom
audit/2026-04-29-chrombpnet-cdf-rebuild
Apr 30, 2026
Merged

audit: rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)#70
lucapinello merged 2 commits intomainfrom
audit/2026-04-29-chrombpnet-cdf-rebuild

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Summary

Built a fresh `chrombpnet_pertrack.npz` (786 tracks, 78.6 MB, sha256 `be61e9e8...`) on Linux/CUDA (2× A100 80 GB) following the handoff at `audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md`. NPZ validated and ready for upload to `lucapinello/chorus-backgrounds`; upload itself is blocked on a write-scope HF token (the cached read-only token returned 403).

Headline numbers

Check Result
Track count 786 (22 ATAC + 20 DNASE + 744 CHIP) — matches handoff's expected count
Effect / summary / perbin reservoirs full at 9609 / 29004 / 928128 — zero failed model loads
All CDFs monotone (786 rows × 3 CDFs)
NaN / Inf in any CDF ✅ none
Wall total ~10 h with parallel-GPU phases (vs handoff estimate 13-25 h)

Magnitude shift vs. 0.2.x CDFs (matches handoff prediction of 5-30 %)

Track old p95 new p95 shift %
ATAC:K562 0.1621 0.1976 +21.9%
DNASE:HepG2 0.1884 0.2436 +29.3%
ATAC:HepG2 0.1332 0.1517 +13.8%
ATAC:GM12878 0.1692 0.1934 +14.3%
DNASE:K562 0.1462 0.1659 +13.5%
CHIP:K562:REST 0.0529 0.0529 0.0%
CHIP:HepG2:CTCF 0.0829 0.0828 −0.1%

ATAC/DNase shifts reflect the bias correction (`chrombpnet_nobias` strips the enzymatic motif preferences the bias-aware variant carried). CHIP/BPNet tracks unchanged because BPNet's catalogue is already nobias-equivalent.

What's blocking the upload

The cached HF token in this env (`HfApi().whoami() → lucapinello, auths: []`) is read-only:

```
huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: you must use a write token to upload to a repository.
```

To finish the rebuild: maintainer re-logs in with a write-scope token (or sets `HF_TOKEN` to one) and runs the upload command from `report.md` "Upload status". The NPZ at `~/.chorus/backgrounds/chrombpnet_pertrack.npz` is unmodified since the merge step — same SHA256 as recorded in the report.

Two script bugs surfaced (worth filing as follow-ups; recoverable)

  1. `--part variants/baselines --assay X` overwrites the interim NPZ. Following the handoff's documented sequence (`--assay ATAC_DNASE` then `--assay CHIP`) silently lost the 42 ATAC/DNase tracks during the merge — the second invocation replaced the interim file rather than appending. Caught by the post-merge spot-check (744 ≠ 786). Recovered via a Phase 1 redo (~50 min) + `--part merge-incremental`. The script should either append-on-write or refuse to overwrite without `--force`.
  2. `--gpu N` silently overrides outer `CUDA_VISIBLE_DEVICES`. `scripts/build_backgrounds_chrombpnet.py:200` does `os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)` unconditionally, so the parallel-launch pattern `CUDA_VISIBLE_DEVICES=1 ... --gpu 0` puts the second job on physical GPU 0 instead of 1. Hit during the Phase 1 redo and OOM'd. The script should respect a pre-set `CUDA_VISIBLE_DEVICES` if present and only set it from `--gpu N` when no env var was provided.

Files

  • `audits/2026-04-29_chrombpnet_cdf_rebuild/report.md` — full audit narrative
  • 9 build logs (`bg_chrombpnet_*.log`) — per-phase, including the failed first-redo for evidence
  • The NPZ itself is not in this PR (78 MB) — sha256 captured in the report so the maintainer can verify post-upload

Test plan

  • Maintainer authenticates with a write-scope HF token and runs the upload command in `report.md`
  • After upload, verify the NPZ on HF has sha256 `be61e9e8f9b919b43c599b7fbc9deb74f8f1e6dc1da5e2cdb92036a85bf13205`
  • Spot-check that `chorus.analysis.normalization.get_pertrack_normalizer("chrombpnet")` picks up the new NPZ (auto-download from HF)
  • Ranking sanity: pick one large-effect SNP at a known regulatory element (e.g. SORT1 rs12740374) and confirm relative cell-type rankings preserved between old and new CDFs

🤖 Generated with Claude Code

lucapinello and others added 2 commits April 30, 2026 10:20
…ult)

Built a fresh `chrombpnet_pertrack.npz` (786 tracks, 78.6 MB,
sha256 be61e9e8...) on Linux/CUDA (2× A100 80 GB) following the
handoff at audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md.

NPZ validated and ready for upload; upload itself is blocked on a
write-scope HF token (the cached read-only token returned 403). See
report.md "Upload status" for the maintainer command.

Headline numbers:
- 786 tracks (22 ATAC + 20 DNASE + 744 CHIP) — matches expected count
- All CDFs monotone, no NaN/Inf
- Every reservoir filled to spec (effect 9609, summary 29004,
  perbin 928128) — zero failed model loads
- ~10 h wall total with parallel-GPU phases (vs handoff estimate 13-25 h)

Magnitude shift vs 0.2.x CDFs (matches handoff prediction of 5-30 % on
ambiguous cases / ~0 % on stable ones):
- ATAC/DNase tracks: +13.5 % to +29.3 % at p95 — bias correction
  stripped the enzymatic motif preferences the bias-aware variant carried
- CHIP/BPNet tracks: ~0 % — BPNet's catalogue is already
  nobias-equivalent

Two follow-ups surfaced (filing as separate issues):
- F1: `--part variants/baselines --assay X` overwrites the interim
  NPZ rather than appending. Following the handoff verbatim caused a
  recoverable (~50 min cost) silent data loss; recovered via Phase 1
  redo + `--part merge-incremental`.
- F2: `--gpu N` arg silently overrides outer `CUDA_VISIBLE_DEVICES`,
  which mis-routed parallel jobs onto the same physical GPU.

Files:
- report.md: full audit narrative, spot-check, diff, follow-ups
- bg_chrombpnet_*.log (9 files): per-phase build logs

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cached HF token in the rebuild box was read-only; maintainer supplied
a write-scope token after the first-pass 403. NPZ uploaded as commit
008483049f8ba75701190db3c17077343c52beb5 on lucapinello/chorus-backgrounds.

Verified by re-downloading from HF and comparing sha256 — matches the
locally-built file (be61e9e8...) exactly.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lucapinello lucapinello marked this pull request as ready for review April 30, 2026 10:46
@lucapinello lucapinello merged commit 8cf1cd9 into main Apr 30, 2026
1 check passed
lucapinello added a commit that referenced this pull request Apr 30, 2026
…streamline README top (#76)

* fix(cdf-rebuild): protect interim NPZ + honour CUDA_VISIBLE_DEVICES; streamline README top

Three changes from the 0.4.0 follow-up triage. Single PR because the
HANDOFF.md tweak depends on both script fixes.

## Bug A (P1, #71/#73): refuse-to-overwrite the interim NPZ

`scripts/build_backgrounds_chrombpnet.py` was writing the interim NPZ
unconditionally with `np.savez_compressed(interim_path, ...)`. The
documented two-pass flow (`--assay ATAC_DNASE` then `--assay CHIP`)
silently overwrote the first pass's 42-track interim with the second
pass's 744 CHIP tracks, producing a 744-track final NPZ instead of the
expected 786 — caught by post-merge spot-check during PR #70 (~50 min
GPU recovery).

Conservative fix per the user's pick: refuse to overwrite without
`--force`, naming the conflicting track-id sets in the SystemExit
message. The new `_check_interim_compatibility()` helper runs before
each interim write site and:
  - returns silently if the path doesn't exist (first run);
  - returns silently if the existing track set equals the new set
    (idempotent re-run with no data loss);
  - raises SystemExit with a diff naming `len(only_existing)`,
    `len(only_new)`, and 3 example track-ids from each side, plus
    pointing at `--part merge` / `merge-incremental` / `--force`;
  - returns silently with `--force`.

## Bug B (P2, #72/#74): honour pre-set CUDA_VISIBLE_DEVICES

`load_models_and_setup()` was clobbering `os.environ["CUDA_VISIBLE_DEVICES"]`
with `--gpu N` unconditionally, so the documented parallel-launch
pattern (outer `CUDA_VISIBLE_DEVICES=N` per terminal pinning the
physical GPU) didn't work — both terminals landed on physical GPU 0,
fighting for memory until one OOMed.

Trivial pattern per the user's pick: only set the env var when nothing
was already set:

    if "CUDA_VISIBLE_DEVICES" not in os.environ:
        os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)

Plus a one-line `--help` clarification on `--gpu`: "No-op if
CUDA_VISIBLE_DEVICES is already set in the calling shell."

## README streamlining

User feedback: the first two sections of "🚀 Get running in one lunch
break" were too dense for a first-timer. Specifically the ~150-word
Prerequisite paragraph after Step 1 (Miniforge, ~28 GB, both AlphaGenome
backends, ChromBPNet streaming, lazy downloads, --all-chrombpnet,
ENCODE fallback) and the 5-line Step 2 blockquote duplicating backend
detail.

Restructure (top section now is just): four-step intro, "Before you
start" mini-section with three bullets (Miniforge link, ~28 GB,
platforms), Steps 1-4 each terse and copy-pasteable. The detail moves
to existing deeper sections — the Step 2 blockquote was already
duplicated in the deeper "Two AlphaGenome backends" subsection (same
content, more detail), so just drops; the Prerequisite paragraph's
disk-usage detail moves into a new `#### Disk usage breakdown`
subsection at the top of `Installation — detailed`. Anchor links
(`#disk-usage-breakdown`, `#two-alphagenome-backends`,
`#where-the-oracle-weights-come-from`) all resolve.

## HANDOFF.md

Adds two short notes so future maintainers don't trip over the new
behaviour: a callout between Phase 1 and Phase 2 reminding to run
`--part merge` (or pass `--force`) before re-running with a different
`--assay`; and a one-line note in the parallel-launch section
explaining the `CUDA_VISIBLE_DEVICES` precedence.

## Tests

`pytest -m "not integration and not slow"` → 368 passed, 1 skipped, 5
deselected — same count as main, no regression. Bug A's
SystemExit-vs-overwrite branches are exercised by the helper's pure
function; a smoke test that builds two tiny interims and asserts the
diff message is left as a follow-up (would need conda-env infra in CI
to actually run the script — out of scope for this PR).

Closes #71, #72, #73, #74.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* README: drop the "Want to start in 2 minutes?" callout from Step 2

User feedback as part of the streamlining pass. The callout was an
escape hatch into a one-oracle-only install for impatient readers,
but it sits between Step 2 (the canonical full setup) and Step 3
(the runnable snippet that uses Enformer anyway), and a first-time
reader following the linear flow doesn't need it interrupting the
narrative. Anyone who specifically wants the lightweight starter
will find `chorus setup --oracle <name>` documented in
`Installation — detailed → Setting up oracle environments
one-by-one` (already present, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* README: more energetic Step 2/3/4 + reorder What-to-read-next

User feedback on the streamlined top:

Step 2 — title was a heavy spec ("Download all 6 oracles + hg38 +
backgrounds"); replaced with "Get every oracle, weight, and
reference — batteries included". Body trimmed of the
multi-GB-tarball / CDF detail that new users don't care about.

Step 3 — title was utilitarian ("Predict — wild-type + SNP effect
in one block"); replaced with "Your first prediction — score a SNP
at the β-globin locus". Added two intro sentences above the code
block explaining what the snippet does and why this prediction
shape (one wild-type signal + N counter-factual variants) is the
canonical chorus pattern.

Step 4 — biggest reorder. Old version led with "ships an MCP
server with 22 tools, here's the full list" before any natural-
language example. New version flips that: lead with one bash
command + three concrete prompts the user can paste into Claude
Code, then mention the 22-tool catalogue at the end as the deeper
read. Title rewritten to convey "complex analyses without coding"
("Skip the code — drive chorus from Claude in plain English").

What-to-read-next reordered so the first two bullets are the
discovery/exploration paths (Notebooks, Worked application
examples — both prompt-driven) instead of the API recipes. API
slipped to fourth.

No code or anchor changes; all internal links still resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* README: tighten "What chorus is" + punch up application examples + Notebooks + MCP intros

Second-pass review. The audit was: where does the reader's interest
flag and why? Five spots fixed:

What chorus is — was three redundant bullet lists. The cover tagline
already names the six oracles, and the "Pick an oracle" table right
below has the per-oracle stats. The "Key features" sub-bullet list
duplicated the lunch-break tour AND the later Key features section.
Replaced with two short paragraphs that say what the cover doesn't:
percentile-grounded outputs (`+0.45 log2FC → 0.962 effect %ile`), per-
oracle conda isolation, chorus-controlled HF mirrors, 22-tool MCP.

Worked application examples — title was a heading, now leads. New
title: "Worked application examples — seven things you can do today."
Intro now opens with the magic ("every example was generated end-to-
end by Claude Code talking to chorus's MCP server, no code written by
hand") instead of burying it.

Notebooks — was "Three notebooks are provided, from introductory to
advanced". Replaced with "Three sittings, zero to confident" + a
sentence describing the user's actual progression. Per-notebook
descriptions rewritten in second-person ("what you'll build") with
plain-English summaries — the "I get it now" notebook, the "graduate-
level" notebook.

Pick an oracle — moved the wall-of-prose paragraph about CUDA / Apple
Metal / tensorflow-metal / PyTorch MPS / JAX-Metal-falls-back-to-CPU
out of "Pick an oracle". Replaced with a one-line "GPU detection is
automatic" pointer to a new Platform & GPU support table inside
Installation — detailed, where someone actually deciding what to
install can find it.

MCP server — was "Chorus includes an MCP server that lets AI
assistants like Claude directly load oracles, predict variant effects,
and analyze gene expression — all through natural language
conversation." Lukewarm. Replaced with a one-liner that ties back to
Step 4 of the lunch-break tour (which the reader just finished, and
which is what hyped them in the first place) and previews what's in
the rest of the section.

Net diff: -7 LOC. Doc is shorter, less redundant, and punches harder
where it matters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant