Skip to content

Commit 8c2243c

Browse files
lucapinellolp698claude
authored
prep: ChromBPNet CDF rebuild against chrombpnet_nobias (script + handoff) (#67)
* prep(chrombpnet-cdf): wire --model-type flag + CUDA-box handoff doc Prep work for the deferred CDF rebuild against `chrombpnet_nobias` that the 0.3.0 audit at `audits/2026-04-28_chrombpnet_slim_mirror/` flagged as a follow-up. Background: chrombpnet_pertrack.npz on huggingface.co/datasets/lucapinello/chorus-backgrounds was built in 0.2.x against the bias-aware `chrombpnet` variant. After 0.3.0 flipped the default to `chrombpnet_nobias` (bias-corrected), user-facing percentile lookups go: `chrombpnet_nobias` predictions → `chrombpnet` empirical CDFs. The bias systematically shifts the mapping. Two changes in this prep commit, no compute yet: 1. scripts/build_backgrounds_chrombpnet.py: add `--model-type` argparse flag (default `chrombpnet_nobias`, matches 0.3+ chorus default). The build script previously called `oracle.load_pretrained_model( fold=args.fold, **spec)` without specifying model_type, so it inherited the oracle's default — which was `chrombpnet` in 0.2.x and is `chrombpnet_nobias` post-0.3. The new flag makes this explicit: re-running today produces the right CDF, and a future maintainer can pin `--model-type chrombpnet` for ablation against the legacy variant. Help text references the rebuild audit dir. 2. audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md: agent handoff for running the rebuild on a CUDA box. Estimated wall-clock: ~3-5 h ATAC/DNase + ~10-20 h CHIP/BPNet on A100 (vs ~78 h on M3 Ultra Metal — too slow for the macOS dev machine). Includes the exact `--shard / --shard-of` commands for 2-GPU parallelism, the spot-check before upload, and the upload command (HfApi to lucapinello/chorus-backgrounds, dataset repo). The actual rebuild (compute + upload) gets done on the user's lab CUDA box in a separate run. This commit is just the prep so the script is unambiguous and the handoff is documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(handoff): note huggingface_hub install gotcha for older envs --------- Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b3bf94a commit 8c2243c

2 files changed

Lines changed: 229 additions & 1 deletion

File tree

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# ChromBPNet CDF rebuild against `chrombpnet_nobias` — agent handoff
2+
3+
You're picking up the ChromBPNet per-track CDF rebuild that the 0.3.0
4+
release (`9e42bfd`) called out as a deferred follow-up. Goal: produce
5+
a fresh `chrombpnet_pertrack.npz` whose percentile lookups match what
6+
`oracle.predict()` returns for the post-0.3 default
7+
(`model_type='chrombpnet_nobias'`), and upload it to
8+
`huggingface.co/datasets/lucapinello/chorus-backgrounds`.
9+
10+
## Why
11+
12+
The CDFs on HF today were built in 0.2.x against the **bias-aware**
13+
`chrombpnet` variant. Chorus 0.3.0 flipped the default to
14+
`chrombpnet_nobias` (bias-corrected). User-facing percentile
15+
assignments now look up `chrombpnet_nobias` predictions against
16+
`chrombpnet` empirical distributions — the bias systematically
17+
shifts the mapping. Audit at
18+
`audits/2026-04-28_chrombpnet_slim_mirror/report.md` flagged this as
19+
"a future release should rebuild the CDFs against `chrombpnet_nobias`
20+
to make the percentiles point at the matching distribution."
21+
22+
## Hardware needed
23+
24+
- **Linux + CUDA GPU** (A100 or similar; chorus's lab box is fine)
25+
- ~50 GB free disk for the slim-mirror models + interim CDF shards
26+
- The `chorus-chrombpnet` conda env (TensorFlow). Build with
27+
`chorus setup --oracle chrombpnet` if you don't have it. Includes
28+
CUDA-enabled TF.
29+
30+
Estimated wall-clock with the now-tracked numbers in
31+
`scripts/build_backgrounds_chrombpnet.py`'s `--assay` help text:
32+
33+
- ATAC/DNase (42 models): ~22 min/model × 42 = **~15 hours on Metal**.
34+
Should be ~3-5× faster on A100 → **~3-5 hours**.
35+
- CHIP/BPNet (1259 models): ~3 min/model × 1259 = **~63 hours on Metal**.
36+
~10-20 hours on A100.
37+
38+
Total CUDA budget: **~13-25 hours**, well-suited for an overnight run
39+
with `--shard` parallelism if you want to split across GPUs.
40+
41+
## What's already done
42+
43+
- `scripts/build_backgrounds_chrombpnet.py` was updated in commit
44+
`<this-commit>` to take `--model-type` (default `chrombpnet_nobias`,
45+
matches the 0.3+ chorus default). Re-running today produces the
46+
right CDFs without a flag.
47+
- The slim HF mirror at `lucapinello/chorus-chrombpnet-slim` already
48+
ships fold-0 `chrombpnet_nobias` for all 42 ATAC/DNase models +
49+
all 744 BPNet/CHIP models (1.49 GB total). The build script will
50+
auto-fetch from there — no ENCODE tarballs needed for the rebuild.
51+
52+
## Run the build
53+
54+
From the repo root, on the CUDA box:
55+
56+
```bash
57+
cd chorus
58+
git pull origin main # make sure you have the --model-type flag
59+
60+
# Confirm env exists
61+
mamba env list | grep chrombpnet
62+
63+
# IMPORTANT — verify huggingface_hub is in the env. The
64+
# chorus-chrombpnet env yml (post-PR-#60) lists huggingface_hub>=0.20.0,
65+
# but older installs predate that and need a manual install. Without
66+
# huggingface_hub the build script falls back to the ~700 MB-per-model
67+
# ENCODE tarball flow, which turns a 3-5 h ATAC/DNase run into a 30+ h
68+
# run because it re-downloads tarballs we don't need.
69+
mamba run -n chorus-chrombpnet python -c "import huggingface_hub" \
70+
|| mamba run -n chorus-chrombpnet pip install "huggingface_hub>=0.20.0"
71+
72+
# Confirm slim mirror is reachable + has all 786 models
73+
mamba run -n chorus-chrombpnet python -c "
74+
from chorus.oracles.chrombpnet_source.chrombpnet_globals import (
75+
iter_unique_models, iter_unique_bpnet_models,
76+
)
77+
print('ATAC/DNASE models:', len(list(iter_unique_models())))
78+
print('CHIP/BPNet models:', len(list(iter_unique_bpnet_models())))
79+
"
80+
# Expect: 42 / 1259 (note: the 744 figure in the slim-mirror manifest is
81+
# de-duped; the 1259 includes per-cell-type duplicates that point at
82+
# the same h5 file in the mirror).
83+
84+
# === Phase 1: ATAC/DNase models (42 models, ~3-5 h on A100) ===
85+
mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
86+
--part variants --assay ATAC_DNASE --gpu 0 \
87+
--model-type chrombpnet_nobias \
88+
2>&1 | tee logs/bg_chrombpnet_variants_atac.log
89+
90+
mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
91+
--part baselines --assay ATAC_DNASE --gpu 0 \
92+
--model-type chrombpnet_nobias \
93+
2>&1 | tee logs/bg_chrombpnet_baselines_atac.log
94+
95+
# === Phase 2: CHIP/BPNet models (1259 models, ~10-20 h on A100) ===
96+
mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
97+
--part variants --assay CHIP --gpu 0 \
98+
--model-type chrombpnet_nobias \
99+
2>&1 | tee logs/bg_chrombpnet_variants_chip.log
100+
101+
mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \
102+
--part baselines --assay CHIP --gpu 0 \
103+
--model-type chrombpnet_nobias \
104+
2>&1 | tee logs/bg_chrombpnet_baselines_chip.log
105+
106+
# === Phase 3: merge interim shards into one NPZ ===
107+
mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge \
108+
2>&1 | tee logs/bg_chrombpnet_merge.log
109+
110+
# Output: ~/.chorus/backgrounds/chrombpnet_pertrack.npz
111+
```
112+
113+
If you have 2 GPUs available, parallelize the CHIP phase across them:
114+
115+
```bash
116+
# Terminal 1 (GPU 0):
117+
... --part variants --assay CHIP --gpu 0 --shard 0 --shard-of 2 ...
118+
... --part baselines --assay CHIP --gpu 0 --shard 0 --shard-of 2 ...
119+
120+
# Terminal 2 (GPU 1):
121+
... --part variants --assay CHIP --gpu 1 --shard 1 --shard-of 2 ...
122+
... --part baselines --assay CHIP --gpu 1 --shard 1 --shard-of 2 ...
123+
124+
# After both finish:
125+
mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge-shards
126+
```
127+
128+
## Spot-check before upload
129+
130+
```bash
131+
mamba run -n chorus python -c "
132+
import numpy as np
133+
npz = np.load('~/.chorus/backgrounds/chrombpnet_pertrack.npz', allow_pickle=True)
134+
print('Keys:', list(npz.keys()))
135+
print('Track count:', len(npz['track_ids']))
136+
print('Effect counts dist:', np.percentile(npz['effect_counts'], [0, 25, 50, 75, 100]))
137+
print('Per-track signedness:', npz['signed_flags'][:5], '... (n_signed=', npz['signed_flags'].sum(), ')')
138+
# Sanity: percentiles are monotone
139+
for r in npz['summary_cdfs'][:5]:
140+
n = r.shape[0]
141+
p50, p95, p99 = int(.50*n), int(.95*n), int(.99*n)
142+
assert r[p50] <= r[p95] <= r[p99], 'CDF not monotone'
143+
print('All sanity checks passed')
144+
"
145+
```
146+
147+
Pass criteria:
148+
- 786 tracks (or whatever the current model catalogue is — same count as old NPZ minus any since-removed models)
149+
- All `effect_counts > 0`
150+
- Summary CDFs monotone p50 ≤ p95 ≤ p99
151+
- No NaN / Inf in any CDF
152+
153+
## Upload to HF
154+
155+
```bash
156+
mamba run -n chorus python -c "
157+
import os
158+
from huggingface_hub import HfApi
159+
api = HfApi(token=os.environ['HF_TOKEN'])
160+
api.upload_file(
161+
path_or_fileobj=os.path.expanduser('~/.chorus/backgrounds/chrombpnet_pertrack.npz'),
162+
path_in_repo='chrombpnet_pertrack.npz',
163+
repo_id='lucapinello/chorus-backgrounds',
164+
repo_type='dataset',
165+
commit_message='Rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)',
166+
)
167+
"
168+
```
169+
170+
(You'll need an HF token with write access to `lucapinello/chorus-backgrounds`
171+
in your env — this is the same token used for other chorus uploads.)
172+
173+
## Write up the audit
174+
175+
In `audits/2026-04-29_chrombpnet_cdf_rebuild/report.md`:
176+
177+
- Hardware (GPU model, distro, RAM)
178+
- Wall-clock per phase (variants + baselines for each assay family)
179+
- Track count + sample percentiles (one or two cell-types) showing the
180+
shift from the old `chrombpnet` CDFs
181+
- Any models that failed to load (the build script logs warnings;
182+
collect the count and make sure it's small / reproducible)
183+
- Hash of the uploaded NPZ for reproducibility
184+
185+
Then commit the audit + log files on a new branch
186+
`audit/2026-04-29-chrombpnet-cdf-rebuild` and open a PR with title
187+
"audit: rebuild ChromBPNet CDFs against `chrombpnet_nobias` (0.3.0+ default)".
188+
189+
## What to do if something breaks
190+
191+
- **Slim-mirror download fails for a specific model**: log it, skip,
192+
continue. The `--only-missing` flag lets you fill gaps later. We
193+
expect 0-1 failures across 786 models; > 5% is a problem worth
194+
surfacing.
195+
- **TensorFlow OOM mid-run**: drop `--batch-size` from 64 to 32 or 16
196+
and resume with `--only-missing`. The reservoir state is checkpointed
197+
per-track to interim NPZ files.
198+
- **Want to compare old vs new percentiles for sanity**: pull the
199+
current `chrombpnet_pertrack.npz` from HF before starting the rebuild
200+
(`huggingface-cli download lucapinello/chorus-backgrounds chrombpnet_pertrack.npz --repo-type dataset`),
201+
diff a few tracks side-by-side. Expected: rankings preserved on
202+
large-effect SNPs, magnitudes shift by 5-30 % on ambiguous cases
203+
(the biology's the same; the mapping just lines up properly now).
204+
205+
## Related
206+
207+
- chorus PR #59 (0.3.0 release) — the default flip that necessitates this
208+
- `audits/2026-04-28_chrombpnet_slim_mirror/report.md` — flagged the deferred CDF rebuild
209+
- `scripts/build_backgrounds_chrombpnet.py` — the build script (now `--model-type` aware)
210+
- `chorus/analysis/normalization.py:PerTrackNormalizer` — how chorus consumes the CDFs at predict-time

scripts/build_backgrounds_chrombpnet.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,19 @@
4343
)
4444
parser.add_argument("--gpu", type=int, default=0)
4545
parser.add_argument("--fold", type=int, default=0)
46+
parser.add_argument(
47+
"--model-type",
48+
choices=["chrombpnet_nobias", "chrombpnet"],
49+
default="chrombpnet_nobias",
50+
help="ChromBPNet variant. Default `chrombpnet_nobias` (bias-corrected) "
51+
"matches the 0.3+ chorus default — the variant the slim HF mirror "
52+
"ships and the one user-facing predictions go through. The legacy "
53+
"`chrombpnet` variant (bias-aware) is available for ablation studies "
54+
"but produces percentiles that don't match what `oracle.predict()` "
55+
"returns for default loads in 0.3+. The pre-0.3 CDFs on HF were built "
56+
"against `chrombpnet`; see audits/2026-04-29_chrombpnet_cdf_rebuild/ "
57+
"for the rebuild against `chrombpnet_nobias`.",
58+
)
4659
parser.add_argument("--n-variants", type=int, default=10000)
4760
parser.add_argument("--reservoir-size", type=int, default=50000)
4861
parser.add_argument("--n-cdf-points", type=int, default=10000)
@@ -422,7 +435,12 @@ def build_all_models(do_variants: bool, do_baselines: bool):
422435
# Pass the spec dict as kwargs — chrombpnet.py accepts:
423436
# load_pretrained_model(assay='ATAC', cell_type='K562', fold=...)
424437
# load_pretrained_model(assay='CHIP', cell_type='K562', TF='REST', fold=...)
425-
oracle.load_pretrained_model(fold=args.fold, **spec)
438+
# model_type is pinned to args.model_type (default
439+
# `chrombpnet_nobias` post-0.3) so the resulting CDF matches
440+
# what `oracle.predict()` returns for default loads.
441+
oracle.load_pretrained_model(
442+
fold=args.fold, model_type=args.model_type, **spec,
443+
)
426444
except Exception as exc:
427445
logger.warning("Failed to load %s: %s", tid, str(exc)[:200])
428446
continue

0 commit comments

Comments
 (0)