|
| 1 | +# ChromBPNet CDF rebuild against `chrombpnet_nobias` — agent handoff |
| 2 | + |
| 3 | +You're picking up the ChromBPNet per-track CDF rebuild that the 0.3.0 |
| 4 | +release (`9e42bfd`) called out as a deferred follow-up. Goal: produce |
| 5 | +a fresh `chrombpnet_pertrack.npz` whose percentile lookups match what |
| 6 | +`oracle.predict()` returns for the post-0.3 default |
| 7 | +(`model_type='chrombpnet_nobias'`), and upload it to |
| 8 | +`huggingface.co/datasets/lucapinello/chorus-backgrounds`. |
| 9 | + |
| 10 | +## Why |
| 11 | + |
| 12 | +The CDFs on HF today were built in 0.2.x against the **bias-aware** |
| 13 | +`chrombpnet` variant. Chorus 0.3.0 flipped the default to |
| 14 | +`chrombpnet_nobias` (bias-corrected). User-facing percentile |
| 15 | +assignments now look up `chrombpnet_nobias` predictions against |
| 16 | +`chrombpnet` empirical distributions — the bias systematically |
| 17 | +shifts the mapping. Audit at |
| 18 | +`audits/2026-04-28_chrombpnet_slim_mirror/report.md` flagged this as |
| 19 | +"a future release should rebuild the CDFs against `chrombpnet_nobias` |
| 20 | +to make the percentiles point at the matching distribution." |
| 21 | + |
| 22 | +## Hardware needed |
| 23 | + |
| 24 | +- **Linux + CUDA GPU** (A100 or similar; chorus's lab box is fine) |
| 25 | +- ~50 GB free disk for the slim-mirror models + interim CDF shards |
| 26 | +- The `chorus-chrombpnet` conda env (TensorFlow). Build with |
| 27 | + `chorus setup --oracle chrombpnet` if you don't have it. Includes |
| 28 | + CUDA-enabled TF. |
| 29 | + |
| 30 | +Estimated wall-clock with the now-tracked numbers in |
| 31 | +`scripts/build_backgrounds_chrombpnet.py`'s `--assay` help text: |
| 32 | + |
| 33 | +- ATAC/DNase (42 models): ~22 min/model × 42 = **~15 hours on Metal**. |
| 34 | + Should be ~3-5× faster on A100 → **~3-5 hours**. |
| 35 | +- CHIP/BPNet (1259 models): ~3 min/model × 1259 = **~63 hours on Metal**. |
| 36 | + ~10-20 hours on A100. |
| 37 | + |
| 38 | +Total CUDA budget: **~13-25 hours**, well-suited for an overnight run |
| 39 | +with `--shard` parallelism if you want to split across GPUs. |
| 40 | + |
| 41 | +## What's already done |
| 42 | + |
| 43 | +- `scripts/build_backgrounds_chrombpnet.py` was updated in commit |
| 44 | + `<this-commit>` to take `--model-type` (default `chrombpnet_nobias`, |
| 45 | + matches the 0.3+ chorus default). Re-running today produces the |
| 46 | + right CDFs without a flag. |
| 47 | +- The slim HF mirror at `lucapinello/chorus-chrombpnet-slim` already |
| 48 | + ships fold-0 `chrombpnet_nobias` for all 42 ATAC/DNase models + |
| 49 | + all 744 BPNet/CHIP models (1.49 GB total). The build script will |
| 50 | + auto-fetch from there — no ENCODE tarballs needed for the rebuild. |
| 51 | + |
| 52 | +## Run the build |
| 53 | + |
| 54 | +From the repo root, on the CUDA box: |
| 55 | + |
| 56 | +```bash |
| 57 | +cd chorus |
| 58 | +git pull origin main # make sure you have the --model-type flag |
| 59 | + |
| 60 | +# Confirm env exists |
| 61 | +mamba env list | grep chrombpnet |
| 62 | + |
| 63 | +# IMPORTANT — verify huggingface_hub is in the env. The |
| 64 | +# chorus-chrombpnet env yml (post-PR-#60) lists huggingface_hub>=0.20.0, |
| 65 | +# but older installs predate that and need a manual install. Without |
| 66 | +# huggingface_hub the build script falls back to the ~700 MB-per-model |
| 67 | +# ENCODE tarball flow, which turns a 3-5 h ATAC/DNase run into a 30+ h |
| 68 | +# run because it re-downloads tarballs we don't need. |
| 69 | +mamba run -n chorus-chrombpnet python -c "import huggingface_hub" \ |
| 70 | + || mamba run -n chorus-chrombpnet pip install "huggingface_hub>=0.20.0" |
| 71 | + |
| 72 | +# Confirm slim mirror is reachable + has all 786 models |
| 73 | +mamba run -n chorus-chrombpnet python -c " |
| 74 | +from chorus.oracles.chrombpnet_source.chrombpnet_globals import ( |
| 75 | + iter_unique_models, iter_unique_bpnet_models, |
| 76 | +) |
| 77 | +print('ATAC/DNASE models:', len(list(iter_unique_models()))) |
| 78 | +print('CHIP/BPNet models:', len(list(iter_unique_bpnet_models()))) |
| 79 | +" |
| 80 | +# Expect: 42 / 1259 (note: the 744 figure in the slim-mirror manifest is |
| 81 | +# de-duped; the 1259 includes per-cell-type duplicates that point at |
| 82 | +# the same h5 file in the mirror). |
| 83 | + |
| 84 | +# === Phase 1: ATAC/DNase models (42 models, ~3-5 h on A100) === |
| 85 | +mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \ |
| 86 | + --part variants --assay ATAC_DNASE --gpu 0 \ |
| 87 | + --model-type chrombpnet_nobias \ |
| 88 | + 2>&1 | tee logs/bg_chrombpnet_variants_atac.log |
| 89 | + |
| 90 | +mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \ |
| 91 | + --part baselines --assay ATAC_DNASE --gpu 0 \ |
| 92 | + --model-type chrombpnet_nobias \ |
| 93 | + 2>&1 | tee logs/bg_chrombpnet_baselines_atac.log |
| 94 | + |
| 95 | +# === Phase 2: CHIP/BPNet models (1259 models, ~10-20 h on A100) === |
| 96 | +mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \ |
| 97 | + --part variants --assay CHIP --gpu 0 \ |
| 98 | + --model-type chrombpnet_nobias \ |
| 99 | + 2>&1 | tee logs/bg_chrombpnet_variants_chip.log |
| 100 | + |
| 101 | +mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \ |
| 102 | + --part baselines --assay CHIP --gpu 0 \ |
| 103 | + --model-type chrombpnet_nobias \ |
| 104 | + 2>&1 | tee logs/bg_chrombpnet_baselines_chip.log |
| 105 | + |
| 106 | +# === Phase 3: merge interim shards into one NPZ === |
| 107 | +mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge \ |
| 108 | + 2>&1 | tee logs/bg_chrombpnet_merge.log |
| 109 | + |
| 110 | +# Output: ~/.chorus/backgrounds/chrombpnet_pertrack.npz |
| 111 | +``` |
| 112 | + |
| 113 | +If you have 2 GPUs available, parallelize the CHIP phase across them: |
| 114 | + |
| 115 | +```bash |
| 116 | +# Terminal 1 (GPU 0): |
| 117 | +... --part variants --assay CHIP --gpu 0 --shard 0 --shard-of 2 ... |
| 118 | +... --part baselines --assay CHIP --gpu 0 --shard 0 --shard-of 2 ... |
| 119 | + |
| 120 | +# Terminal 2 (GPU 1): |
| 121 | +... --part variants --assay CHIP --gpu 1 --shard 1 --shard-of 2 ... |
| 122 | +... --part baselines --assay CHIP --gpu 1 --shard 1 --shard-of 2 ... |
| 123 | + |
| 124 | +# After both finish: |
| 125 | +mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge-shards |
| 126 | +``` |
| 127 | + |
| 128 | +## Spot-check before upload |
| 129 | + |
| 130 | +```bash |
| 131 | +mamba run -n chorus python -c " |
| 132 | +import numpy as np |
| 133 | +npz = np.load('~/.chorus/backgrounds/chrombpnet_pertrack.npz', allow_pickle=True) |
| 134 | +print('Keys:', list(npz.keys())) |
| 135 | +print('Track count:', len(npz['track_ids'])) |
| 136 | +print('Effect counts dist:', np.percentile(npz['effect_counts'], [0, 25, 50, 75, 100])) |
| 137 | +print('Per-track signedness:', npz['signed_flags'][:5], '... (n_signed=', npz['signed_flags'].sum(), ')') |
| 138 | +# Sanity: percentiles are monotone |
| 139 | +for r in npz['summary_cdfs'][:5]: |
| 140 | + n = r.shape[0] |
| 141 | + p50, p95, p99 = int(.50*n), int(.95*n), int(.99*n) |
| 142 | + assert r[p50] <= r[p95] <= r[p99], 'CDF not monotone' |
| 143 | +print('All sanity checks passed') |
| 144 | +" |
| 145 | +``` |
| 146 | + |
| 147 | +Pass criteria: |
| 148 | +- 786 tracks (or whatever the current model catalogue is — same count as old NPZ minus any since-removed models) |
| 149 | +- All `effect_counts > 0` |
| 150 | +- Summary CDFs monotone p50 ≤ p95 ≤ p99 |
| 151 | +- No NaN / Inf in any CDF |
| 152 | + |
| 153 | +## Upload to HF |
| 154 | + |
| 155 | +```bash |
| 156 | +mamba run -n chorus python -c " |
| 157 | +import os |
| 158 | +from huggingface_hub import HfApi |
| 159 | +api = HfApi(token=os.environ['HF_TOKEN']) |
| 160 | +api.upload_file( |
| 161 | + path_or_fileobj=os.path.expanduser('~/.chorus/backgrounds/chrombpnet_pertrack.npz'), |
| 162 | + path_in_repo='chrombpnet_pertrack.npz', |
| 163 | + repo_id='lucapinello/chorus-backgrounds', |
| 164 | + repo_type='dataset', |
| 165 | + commit_message='Rebuild ChromBPNet CDFs against chrombpnet_nobias (0.3.0+ default)', |
| 166 | +) |
| 167 | +" |
| 168 | +``` |
| 169 | + |
| 170 | +(You'll need an HF token with write access to `lucapinello/chorus-backgrounds` |
| 171 | +in your env — this is the same token used for other chorus uploads.) |
| 172 | + |
| 173 | +## Write up the audit |
| 174 | + |
| 175 | +In `audits/2026-04-29_chrombpnet_cdf_rebuild/report.md`: |
| 176 | + |
| 177 | +- Hardware (GPU model, distro, RAM) |
| 178 | +- Wall-clock per phase (variants + baselines for each assay family) |
| 179 | +- Track count + sample percentiles (one or two cell-types) showing the |
| 180 | + shift from the old `chrombpnet` CDFs |
| 181 | +- Any models that failed to load (the build script logs warnings; |
| 182 | + collect the count and make sure it's small / reproducible) |
| 183 | +- Hash of the uploaded NPZ for reproducibility |
| 184 | + |
| 185 | +Then commit the audit + log files on a new branch |
| 186 | +`audit/2026-04-29-chrombpnet-cdf-rebuild` and open a PR with title |
| 187 | +"audit: rebuild ChromBPNet CDFs against `chrombpnet_nobias` (0.3.0+ default)". |
| 188 | + |
| 189 | +## What to do if something breaks |
| 190 | + |
| 191 | +- **Slim-mirror download fails for a specific model**: log it, skip, |
| 192 | + continue. The `--only-missing` flag lets you fill gaps later. We |
| 193 | + expect 0-1 failures across 786 models; > 5% is a problem worth |
| 194 | + surfacing. |
| 195 | +- **TensorFlow OOM mid-run**: drop `--batch-size` from 64 to 32 or 16 |
| 196 | + and resume with `--only-missing`. The reservoir state is checkpointed |
| 197 | + per-track to interim NPZ files. |
| 198 | +- **Want to compare old vs new percentiles for sanity**: pull the |
| 199 | + current `chrombpnet_pertrack.npz` from HF before starting the rebuild |
| 200 | + (`huggingface-cli download lucapinello/chorus-backgrounds chrombpnet_pertrack.npz --repo-type dataset`), |
| 201 | + diff a few tracks side-by-side. Expected: rankings preserved on |
| 202 | + large-effect SNPs, magnitudes shift by 5-30 % on ambiguous cases |
| 203 | + (the biology's the same; the mapping just lines up properly now). |
| 204 | + |
| 205 | +## Related |
| 206 | + |
| 207 | +- chorus PR #59 (0.3.0 release) — the default flip that necessitates this |
| 208 | +- `audits/2026-04-28_chrombpnet_slim_mirror/report.md` — flagged the deferred CDF rebuild |
| 209 | +- `scripts/build_backgrounds_chrombpnet.py` — the build script (now `--model-type` aware) |
| 210 | +- `chorus/analysis/normalization.py:PerTrackNormalizer` — how chorus consumes the CDFs at predict-time |
0 commit comments