prep: ChromBPNet CDF rebuild against chrombpnet_nobias (script + handoff) (#67)

lucapinello · lp698 · claude · web-flow · commit 8c2243c2f92f · 2026-04-29T20:10:08.000-04:00
* prep(chrombpnet-cdf): wire --model-type flag + CUDA-box handoff doc

Prep work for the deferred CDF rebuild against `chrombpnet_nobias`
that the 0.3.0 audit at `audits/2026-04-28_chrombpnet_slim_mirror/`
flagged as a follow-up.

Background: chrombpnet_pertrack.npz on
huggingface.co/datasets/lucapinello/chorus-backgrounds was built in
0.2.x against the bias-aware `chrombpnet` variant. After 0.3.0 flipped
the default to `chrombpnet_nobias` (bias-corrected), user-facing
percentile lookups go: `chrombpnet_nobias` predictions → `chrombpnet`
empirical CDFs. The bias systematically shifts the mapping.

Two changes in this prep commit, no compute yet:

1. scripts/build_backgrounds_chrombpnet.py: add `--model-type` argparse
   flag (default `chrombpnet_nobias`, matches 0.3+ chorus default).
   The build script previously called `oracle.load_pretrained_model(
   fold=args.fold, **spec)` without specifying model_type, so it
   inherited the oracle's default — which was `chrombpnet` in 0.2.x and
   is `chrombpnet_nobias` post-0.3. The new flag makes this explicit:
   re-running today produces the right CDF, and a future maintainer
   can pin `--model-type chrombpnet` for ablation against the legacy
   variant. Help text references the rebuild audit dir.

2. audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md: agent handoff
   for running the rebuild on a CUDA box. Estimated wall-clock:
   ~3-5 h ATAC/DNase + ~10-20 h CHIP/BPNet on A100 (vs ~78 h on M3
   Ultra Metal — too slow for the macOS dev machine). Includes the
   exact `--shard / --shard-of` commands for 2-GPU parallelism, the
   spot-check before upload, and the upload command (HfApi to
   lucapinello/chorus-backgrounds, dataset repo).

The actual rebuild (compute + upload) gets done on the user's lab CUDA
box in a separate run. This commit is just the prep so the script is
unambiguous and the handoff is documented.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* docs(handoff): note huggingface_hub install gotcha for older envs

---------

Co-authored-by: lp698 &lt;lp698@dimm2fv07n65x.partners.org&gt;
Co-authored-by: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md b/audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md
@@ -0,0 +1,210 @@
+# ChromBPNet CDF rebuild against `chrombpnet_nobias` — agent handoff
+
+You're picking up the ChromBPNet per-track CDF rebuild that the 0.3.0
+release (`9e42bfd`) called out as a deferred follow-up. Goal: produce
+a fresh `chrombpnet_pertrack.npz` whose percentile lookups match what
+`oracle.predict()` returns for the post-0.3 default
+(`model_type='chrombpnet_nobias'`), and upload it to
+`huggingface.co/datasets/lucapinello/chorus-backgrounds`.
+
+## Why
+
+The CDFs on HF today were built in 0.2.x against the **bias-aware**
+`chrombpnet` variant. Chorus 0.3.0 flipped the default to
+`chrombpnet_nobias` (bias-corrected). User-facing percentile
+assignments now look up `chrombpnet_nobias` predictions against
+`chrombpnet` empirical distributions — the bias systematically
+shifts the mapping. Audit at
+`audits/2026-04-28_chrombpnet_slim_mirror/report.md` flagged this as
+"a future release should rebuild the CDFs against `chrombpnet_nobias`
+to make the percentiles point at the matching distribution."
+
+## Hardware needed
+
+- **Linux + CUDA GPU** (A100 or similar; chorus's lab box is fine)
+- ~50 GB free disk for the slim-mirror models + interim CDF shards
+- The `chorus-chrombpnet` conda env (TensorFlow). Build with
+  `chorus setup --oracle chrombpnet` if you don't have it. Includes
+  CUDA-enabled TF.
+
+Estimated wall-clock with the now-tracked numbers in
+`scripts/build_backgrounds_chrombpnet.py`'s `--assay` help text:
+
+- ATAC/DNase (42 models): ~22 min/model × 42 = **~15 hours on Metal**.
+  Should be ~3-5× faster on A100 → **~3-5 hours**.
+- CHIP/BPNet (1259 models): ~3 min/model × 1259 = **~63 hours on Metal**.
+  ~10-20 hours on A100.
+
+Total CUDA budget: **~13-25 hours**, well-suited for an overnight run
+with `--shard` parallelism if you want to split across GPUs.
+
+## What's already done
+
+- `scripts/build_backgrounds_chrombpnet.py` was updated in commit
+  `<this-commit>` to take `--model-type` (default `chrombpnet_nobias`,
+  matches the 0.3+ chorus default). Re-running today produces the
+  right CDFs without a flag.
+- The slim HF mirror at `lucapinello/chorus-chrombpnet-slim` already
+  ships fold-0 `chrombpnet_nobias` for all 42 ATAC/DNase models +
+  all 744 BPNet/CHIP models (1.49 GB total). The build script will
+  auto-fetch from there — no ENCODE tarballs needed for the rebuild.
+
+## Run the build
+
+From the repo root, on the CUDA box:
+
+```bash
+cd chorus
+git pull origin main  # make sure you have the --model-type flag
+
+# Confirm env exists
+mamba env list | grep chrombpnet
+
+# IMPORTANT — verify huggingface_hub is in the env. The
+# chorus-chrombpnet env yml (post-PR-#60) lists huggingface_hub>=0.20.0,
+# but older installs predate that and need a manual install. Without
+# huggingface_hub the build script falls back to the ~700 MB-per-model
+# ENCODE tarball flow, which turns a 3-5 h ATAC/DNase run into a 30+ h
+# run because it re-downloads tarballs we don't need.
+mamba run -n chorus-chrombpnet python -c "import huggingface_hub" \
+  || mamba run -n chorus-chrombpnet pip install "huggingface_hub>=0.20.0"
+
+# Confirm slim mirror is reachable + has all 786 models
+mamba run -n chorus-chrombpnet python -c "
+from chorus.oracles.chrombpnet_source.chrombpnet_globals import (
+    iter_unique_models, iter_unique_bpnet_models,
+)
+print('ATAC/DNASE models:', len(list(iter_unique_models())))
+print('CHIP/BPNet models:', len(list(iter_unique_bpnet_models())))
+"
+# Expect: 42 / 1259 (note: the 744 figure in the slim-mirror manifest is
+# de-duped; the 1259 includes per-cell-type duplicates that point at
+# the same h5 file in the mirror).
+
+# === Phase 1: ATAC/DNase models (42 models, ~3-5 h on A100) ===
+mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
+  --part variants --assay ATAC_DNASE --gpu 0 \
+  --model-type chrombpnet_nobias \
+  2>&1 | tee logs/bg_chrombpnet_variants_atac.log
+
+mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
+  --part baselines --assay ATAC_DNASE --gpu 0 \
+  --model-type chrombpnet_nobias \
+  2>&1 | tee logs/bg_chrombpnet_baselines_atac.log
+
+# === Phase 2: CHIP/BPNet models (1259 models, ~10-20 h on A100) ===
+mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
+  --part variants --assay CHIP --gpu 0 \
+  --model-type chrombpnet_nobias \
+  2>&1 | tee logs/bg_chrombpnet_variants_chip.log
+
+mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
+  --part baselines --assay CHIP --gpu 0 \
+  --model-type chrombpnet_nobias \
+  2>&1 | tee logs/bg_chrombpnet_baselines_chip.log
+
+# === Phase 3: merge interim shards into one NPZ ===
+mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge \
+  2>&1 | tee logs/bg_chrombpnet_merge.log
+
+# Output: ~/.chorus/backgrounds/chrombpnet_pertrack.npz
+```
+
+If you have 2 GPUs available, parallelize the CHIP phase across them:
+
+```bash
+# Terminal 1 (GPU 0):
+... --part variants --assay CHIP --gpu 0 --shard 0 --shard-of 2 ...
+... --part baselines --assay CHIP --gpu 0 --shard 0 --shard-of 2 ...
+
+# Terminal 2 (GPU 1):
+... --part variants --assay CHIP --gpu 1 --shard 1 --shard-of 2 ...
+... --part baselines --assay CHIP --gpu 1 --shard 1 --shard-of 2 ...
+
+# After both finish:
+mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge-shards
+```
+
+## Spot-check before upload
+
+```bash
+mamba run -n chorus python -c "
+import numpy as np
+npz = np.load('~/.chorus/backgrounds/chrombpnet_pertrack.npz', allow_pickle=True)
+print('Keys:', list(npz.keys()))
+print('Track count:', len(npz['track_ids']))
+print('Effect counts dist:', np.percentile(npz['effect_counts'], [0, 25, 50, 75, 100]))
+print('Per-track signedness:', npz['signed_flags'][:5], '... (n_signed=', npz['signed_flags'].sum(), ')')
+# Sanity: percentiles are monotone
+for r in npz['summary_cdfs'][:5]:
+    n = r.shape[0]
+    p50, p95, p99 = int(.50*n), int(.95*n), int(.99*n)
+    assert r[p50] <= r[p95] <= r[p99], 'CDF not monotone'
+print('All sanity checks passed')
+"
+```
+
+Pass criteria:
+- 786 tracks (or whatever the current model catalogue is — same count as old NPZ minus any since-removed models)
+- All `effect_counts > 0`
+- Summary CDFs monotone p50 ≤ p95 ≤ p99
+- No NaN / Inf in any CDF
+
+## Upload to HF
+
+```bash
+mamba run -n chorus python -c "
+import os
+from huggingface_hub import HfApi
+api = HfApi(token=os.environ['HF_TOKEN'])
+api.upload_file(
+    path_or_fileobj=os.path.expanduser('~/.chorus/backgrounds/chrombpnet_pertrack.npz'),
+    path_in_repo='chrombpnet_pertrack.npz',
+    repo_id='lucapinello/chorus-backgrounds',
+    repo_type='dataset',
+    commit_message='Rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)',
+)
+"
+```
+
+(You'll need an HF token with write access to `lucapinello/chorus-backgrounds`
+in your env — this is the same token used for other chorus uploads.)
+
+## Write up the audit
+
+In `audits/2026-04-29_chrombpnet_cdf_rebuild/report.md`:
+
+- Hardware (GPU model, distro, RAM)
+- Wall-clock per phase (variants + baselines for each assay family)
+- Track count + sample percentiles (one or two cell-types) showing the
+  shift from the old `chrombpnet` CDFs
+- Any models that failed to load (the build script logs warnings;
+  collect the count and make sure it's small / reproducible)
+- Hash of the uploaded NPZ for reproducibility
+
+Then commit the audit + log files on a new branch
+`audit/2026-04-29-chrombpnet-cdf-rebuild` and open a PR with title
+"audit: rebuild ChromBPNet CDFs against `chrombpnet_nobias` (0.3.0+ default)".
+
+## What to do if something breaks
+
+- **Slim-mirror download fails for a specific model**: log it, skip,
+  continue. The `--only-missing` flag lets you fill gaps later. We
+  expect 0-1 failures across 786 models; > 5% is a problem worth
+  surfacing.
+- **TensorFlow OOM mid-run**: drop `--batch-size` from 64 to 32 or 16
+  and resume with `--only-missing`. The reservoir state is checkpointed
+  per-track to interim NPZ files.
+- **Want to compare old vs new percentiles for sanity**: pull the
+  current `chrombpnet_pertrack.npz` from HF before starting the rebuild
+  (`huggingface-cli download lucapinello/chorus-backgrounds chrombpnet_pertrack.npz --repo-type dataset`),
+  diff a few tracks side-by-side. Expected: rankings preserved on
+  large-effect SNPs, magnitudes shift by 5-30 % on ambiguous cases
+  (the biology's the same; the mapping just lines up properly now).
+
+## Related
+
+- chorus PR #59 (0.3.0 release) — the default flip that necessitates this
+- `audits/2026-04-28_chrombpnet_slim_mirror/report.md` — flagged the deferred CDF rebuild
+- `scripts/build_backgrounds_chrombpnet.py` — the build script (now `--model-type` aware)
+- `chorus/analysis/normalization.py:PerTrackNormalizer` — how chorus consumes the CDFs at predict-time
diff --git a/scripts/build_backgrounds_chrombpnet.py b/scripts/build_backgrounds_chrombpnet.py
@@ -43,6 +43,19 @@
 )
 parser.add_argument("--gpu", type=int, default=0)
 parser.add_argument("--fold", type=int, default=0)
+parser.add_argument(
+    "--model-type",
+    choices=["chrombpnet_nobias", "chrombpnet"],
+    default="chrombpnet_nobias",
+    help="ChromBPNet variant. Default `chrombpnet_nobias` (bias-corrected) "
+    "matches the 0.3+ chorus default — the variant the slim HF mirror "
+    "ships and the one user-facing predictions go through. The legacy "
+    "`chrombpnet` variant (bias-aware) is available for ablation studies "
+    "but produces percentiles that don't match what `oracle.predict()` "
+    "returns for default loads in 0.3+. The pre-0.3 CDFs on HF were built "
+    "against `chrombpnet`; see audits/2026-04-29_chrombpnet_cdf_rebuild/ "
+    "for the rebuild against `chrombpnet_nobias`.",
+)
 parser.add_argument("--n-variants", type=int, default=10000)
 parser.add_argument("--reservoir-size", type=int, default=50000)
 parser.add_argument("--n-cdf-points", type=int, default=10000)
@@ -422,7 +435,12 @@ def build_all_models(do_variants: bool, do_baselines: bool):
             # Pass the spec dict as kwargs — chrombpnet.py accepts:
             #   load_pretrained_model(assay='ATAC', cell_type='K562', fold=...)
             #   load_pretrained_model(assay='CHIP', cell_type='K562', TF='REST', fold=...)
-            oracle.load_pretrained_model(fold=args.fold, **spec)
+            # model_type is pinned to args.model_type (default
+            # `chrombpnet_nobias` post-0.3) so the resulting CDF matches
+            # what `oracle.predict()` returns for default loads.
+            oracle.load_pretrained_model(
+                fold=args.fold, model_type=args.model_type, **spec,
+            )
         except Exception as exc:
             logger.warning("Failed to load %s: %s", tid, str(exc)[:200])
             continue