Skip to content

fix(cdf-rebuild): protect interim NPZ + honour CUDA_VISIBLE_DEVICES; streamline README top#76

Merged
lucapinello merged 4 commits intomainfrom
fix/cdf-rebuild-tooling-readme-streamline
Apr 30, 2026
Merged

fix(cdf-rebuild): protect interim NPZ + honour CUDA_VISIBLE_DEVICES; streamline README top#76
lucapinello merged 4 commits intomainfrom
fix/cdf-rebuild-tooling-readme-streamline

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

fix(cdf-rebuild): protect interim NPZ + honour CUDA_VISIBLE_DEVICES; streamline README top

Three changes from the 0.4.0 follow-up triage. Single PR because the HANDOFF.md edits depend on both script fixes.

Closes

#73/#74 are exact duplicates of #71/#72 — opened ~5 min apart by concurrent agent triage. Closing both pairs.

Bug A (P1) — refuse-to-overwrite the interim NPZ

User's pick was the conservative fix: refuse to overwrite without --force, naming the conflicting track-id sets.

  • New --force argparse flag.
  • New _check_interim_compatibility(interim_path, new_track_ids, force, label) helper, called before each np.savez_compressed(interim_path, ...). Returns silently when the file doesn't exist (first run) or when the existing track set equals the new set (idempotent re-run); otherwise raises SystemExit with a diff (len(only_existing), len(only_new), three example track-ids per side) and a pointer at --part merge / --part merge-incremental / --force.
  • Matches existing raise SystemExit("...") validation-error convention in the file (lines 245, 247).

Append-on-write (#71's option (a)) was deferred — better ergonomics but a v0.5+ improvement.

Bug B (P2) — honour pre-set CUDA_VISIBLE_DEVICES

User's pick was the trivial pattern. 3 LOC:

- os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)
+ if "CUDA_VISIBLE_DEVICES" not in os.environ:
+     os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)

Plus a one-line --help clarification on --gpu: "No-op if CUDA_VISIBLE_DEVICES is already set in the calling shell."

When the env var is set, the outer wins — matches every cluster scheduler's precedence convention.

README streamlining

Reader feedback was that the first two sections of "🚀 Get running in one lunch break" were too dense for a first-timer:

  • A ~150-word Prerequisite paragraph under Step 1 (Miniforge, ~28 GB disk, both AlphaGenome backends, ChromBPNet slim mirror, lazy downloads, --all-chrombpnet, ENCODE fallback)
  • A 5-line Step 2 blockquote duplicating the backend-comparison detail that's already in the deeper #### Two AlphaGenome backends section

Restructure:

Before After
Step 1 → install commands → 150-word Prerequisite paragraph Three-bullet "Before you start" mini-section between intro and Step 1 (Miniforge link, ~28 GB, platforms); Step 1 = just install commands
Step 2 blockquote duplicating backend detail Step 2 = just chorus setup + the existing token / 2-min options
(no canonical home for disk-usage detail) New #### Disk usage breakdown subsection at the top of Installation — detailed with a per-bucket table

Anchor links (#disk-usage-breakdown, #two-alphagenome-backends, #where-the-oracle-weights-come-from) all resolve. The detail isn't lost — just moved to the deeper section that's already documented as the home for "edge cases: upgrading, per-oracle setup, token plumbing, manual genome management, and backgrounds."

The first screen now reads as: title → tagline → four-step intro → three-bullet "Before you start" → install commands. No prose-walls, no premature backend complexity.

HANDOFF.md (audits/2026-04-29_chrombpnet_cdf_rebuild/)

Two short notes so future maintainers don't re-trip over the same bugs:

  • A callout between Phase 1 and Phase 2 reminding to run --part merge (or merge-incremental, or pass --force) before re-running with a different --assay.
  • A one-line note in the parallel-launch section explaining the CUDA_VISIBLE_DEVICES/--gpu precedence.

Test plan

  • --help parses cleanly on the modified script. Both new flags visible (--force, updated --gpu help-text).
  • Fast suite green: pytest tests/ -q -m "not integration and not slow"368 passed, 1 skipped, 5 deselected (same count as main — no regression).
  • README anchor links resolve: #disk-usage-breakdown → new line 199; #two-alphagenome-backends → existing line 173; #where-the-oracle-weights-come-from → existing line 310.
  • No content lost: Other ChromBPNet cell types download lazily and 5–8× faster than JAX CPU distinctive strings appear exactly once each (in their new homes).
  • End-to-end --force smoke: would need conda-env infra in CI to actually invoke the script. Helper is a pure function over the existing track-id reservoir code — verified by inspection. Follow-up tracked at the file-level next time someone runs a partial CDF rebuild on an A100 box.

🤖 Generated with Claude Code

lucapinello and others added 4 commits April 30, 2026 08:46
…streamline README top

Three changes from the 0.4.0 follow-up triage. Single PR because the
HANDOFF.md tweak depends on both script fixes.

## Bug A (P1, #71/#73): refuse-to-overwrite the interim NPZ

`scripts/build_backgrounds_chrombpnet.py` was writing the interim NPZ
unconditionally with `np.savez_compressed(interim_path, ...)`. The
documented two-pass flow (`--assay ATAC_DNASE` then `--assay CHIP`)
silently overwrote the first pass's 42-track interim with the second
pass's 744 CHIP tracks, producing a 744-track final NPZ instead of the
expected 786 — caught by post-merge spot-check during PR #70 (~50 min
GPU recovery).

Conservative fix per the user's pick: refuse to overwrite without
`--force`, naming the conflicting track-id sets in the SystemExit
message. The new `_check_interim_compatibility()` helper runs before
each interim write site and:
  - returns silently if the path doesn't exist (first run);
  - returns silently if the existing track set equals the new set
    (idempotent re-run with no data loss);
  - raises SystemExit with a diff naming `len(only_existing)`,
    `len(only_new)`, and 3 example track-ids from each side, plus
    pointing at `--part merge` / `merge-incremental` / `--force`;
  - returns silently with `--force`.

## Bug B (P2, #72/#74): honour pre-set CUDA_VISIBLE_DEVICES

`load_models_and_setup()` was clobbering `os.environ["CUDA_VISIBLE_DEVICES"]`
with `--gpu N` unconditionally, so the documented parallel-launch
pattern (outer `CUDA_VISIBLE_DEVICES=N` per terminal pinning the
physical GPU) didn't work — both terminals landed on physical GPU 0,
fighting for memory until one OOMed.

Trivial pattern per the user's pick: only set the env var when nothing
was already set:

    if "CUDA_VISIBLE_DEVICES" not in os.environ:
        os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)

Plus a one-line `--help` clarification on `--gpu`: "No-op if
CUDA_VISIBLE_DEVICES is already set in the calling shell."

## README streamlining

User feedback: the first two sections of "🚀 Get running in one lunch
break" were too dense for a first-timer. Specifically the ~150-word
Prerequisite paragraph after Step 1 (Miniforge, ~28 GB, both AlphaGenome
backends, ChromBPNet streaming, lazy downloads, --all-chrombpnet,
ENCODE fallback) and the 5-line Step 2 blockquote duplicating backend
detail.

Restructure (top section now is just): four-step intro, "Before you
start" mini-section with three bullets (Miniforge link, ~28 GB,
platforms), Steps 1-4 each terse and copy-pasteable. The detail moves
to existing deeper sections — the Step 2 blockquote was already
duplicated in the deeper "Two AlphaGenome backends" subsection (same
content, more detail), so just drops; the Prerequisite paragraph's
disk-usage detail moves into a new `#### Disk usage breakdown`
subsection at the top of `Installation — detailed`. Anchor links
(`#disk-usage-breakdown`, `#two-alphagenome-backends`,
`#where-the-oracle-weights-come-from`) all resolve.

## HANDOFF.md

Adds two short notes so future maintainers don't trip over the new
behaviour: a callout between Phase 1 and Phase 2 reminding to run
`--part merge` (or pass `--force`) before re-running with a different
`--assay`; and a one-line note in the parallel-launch section
explaining the `CUDA_VISIBLE_DEVICES` precedence.

## Tests

`pytest -m "not integration and not slow"` → 368 passed, 1 skipped, 5
deselected — same count as main, no regression. Bug A's
SystemExit-vs-overwrite branches are exercised by the helper's pure
function; a smoke test that builds two tiny interims and asserts the
diff message is left as a follow-up (would need conda-env infra in CI
to actually run the script — out of scope for this PR).

Closes #71, #72, #73, #74.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback as part of the streamlining pass. The callout was an
escape hatch into a one-oracle-only install for impatient readers,
but it sits between Step 2 (the canonical full setup) and Step 3
(the runnable snippet that uses Enformer anyway), and a first-time
reader following the linear flow doesn't need it interrupting the
narrative. Anyone who specifically wants the lightweight starter
will find `chorus setup --oracle <name>` documented in
`Installation — detailed → Setting up oracle environments
one-by-one` (already present, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback on the streamlined top:

Step 2 — title was a heavy spec ("Download all 6 oracles + hg38 +
backgrounds"); replaced with "Get every oracle, weight, and
reference — batteries included". Body trimmed of the
multi-GB-tarball / CDF detail that new users don't care about.

Step 3 — title was utilitarian ("Predict — wild-type + SNP effect
in one block"); replaced with "Your first prediction — score a SNP
at the β-globin locus". Added two intro sentences above the code
block explaining what the snippet does and why this prediction
shape (one wild-type signal + N counter-factual variants) is the
canonical chorus pattern.

Step 4 — biggest reorder. Old version led with "ships an MCP
server with 22 tools, here's the full list" before any natural-
language example. New version flips that: lead with one bash
command + three concrete prompts the user can paste into Claude
Code, then mention the 22-tool catalogue at the end as the deeper
read. Title rewritten to convey "complex analyses without coding"
("Skip the code — drive chorus from Claude in plain English").

What-to-read-next reordered so the first two bullets are the
discovery/exploration paths (Notebooks, Worked application
examples — both prompt-driven) instead of the API recipes. API
slipped to fourth.

No code or anchor changes; all internal links still resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tebooks + MCP intros

Second-pass review. The audit was: where does the reader's interest
flag and why? Five spots fixed:

What chorus is — was three redundant bullet lists. The cover tagline
already names the six oracles, and the "Pick an oracle" table right
below has the per-oracle stats. The "Key features" sub-bullet list
duplicated the lunch-break tour AND the later Key features section.
Replaced with two short paragraphs that say what the cover doesn't:
percentile-grounded outputs (`+0.45 log2FC → 0.962 effect %ile`), per-
oracle conda isolation, chorus-controlled HF mirrors, 22-tool MCP.

Worked application examples — title was a heading, now leads. New
title: "Worked application examples — seven things you can do today."
Intro now opens with the magic ("every example was generated end-to-
end by Claude Code talking to chorus's MCP server, no code written by
hand") instead of burying it.

Notebooks — was "Three notebooks are provided, from introductory to
advanced". Replaced with "Three sittings, zero to confident" + a
sentence describing the user's actual progression. Per-notebook
descriptions rewritten in second-person ("what you'll build") with
plain-English summaries — the "I get it now" notebook, the "graduate-
level" notebook.

Pick an oracle — moved the wall-of-prose paragraph about CUDA / Apple
Metal / tensorflow-metal / PyTorch MPS / JAX-Metal-falls-back-to-CPU
out of "Pick an oracle". Replaced with a one-line "GPU detection is
automatic" pointer to a new Platform & GPU support table inside
Installation — detailed, where someone actually deciding what to
install can find it.

MCP server — was "Chorus includes an MCP server that lets AI
assistants like Claude directly load oracles, predict variant effects,
and analyze gene expression — all through natural language
conversation." Lukewarm. Replaced with a one-liner that ties back to
Step 4 of the lunch-break tour (which the reader just finished, and
which is what hyped them in the first place) and previews what's in
the rest of the section.

Net diff: -7 LOC. Doc is shorter, less redundant, and punches harder
where it matters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant