Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -142,3 +142,10 @@ scripts/internal/

# Per-oracle pickle snapshots for multi-oracle consolidator (regenerate on demand)
examples/walkthroughs/validation/SORT1_rs12740374_multioracle/*.pkl

# DHS vocabulary download (90 MB, fetched via gdown when needed)
annotations/

# Local screenshot sweeps (regenerate via headless Chrome on demand)
screenshots/
examples/walkthroughs/**/screenshot_*.png
36 changes: 31 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,15 +213,32 @@ After the first install, to upgrade cleanly:

```bash
cd chorus && git pull
# Remove oracle envs first (while the chorus CLI is still available):
chorus remove --oracle enformer
# Repeat for each oracle you had installed...
# Remove all oracle envs, weights, and backgrounds in one command:
chorus cleanup --all
# Then remove the base env:
mamba env remove -n chorus -y
```

Then re-run the Fresh Install steps above.

#### Uninstalling / starting from scratch

```bash
# Preview what will be deleted (safe — no changes):
chorus cleanup --all --dry-run

# Remove everything: all oracle envs, downloaded weights, background CDFs, and genomes:
chorus cleanup --all

# Finer-grained options:
chorus cleanup --oracle enformer # one oracle only (env + weights)
chorus cleanup --oracle all # all oracle envs + weights, keep backgrounds/genomes
chorus cleanup --backgrounds # remove ~/.chorus/backgrounds/*.npz only
chorus cleanup --genomes # remove downloaded reference genomes only
```

The base `chorus` environment itself is not removed by `chorus cleanup` — remove it manually with `mamba env remove -n chorus -y` if you want a complete wipe.

#### Setting up oracle environments one-by-one

Chorus uses isolated conda environments for each oracle to avoid dependency conflicts between TensorFlow, PyTorch, and JAX models.
Expand Down Expand Up @@ -258,6 +275,14 @@ chorus health --timeout 300

**Note:** `chorus setup` pre-downloads each oracle's default weights + background CDFs + the `hg38` reference at install time, so subsequent `chorus health` / prediction calls are fast. If you opted out via `--no-weights`, the first prediction will still do a lazy download.

**Slow or unstable connection?** Use `--setup-timeout SECONDS` to cap how long each phase (env build and weight download) is allowed to run before aborting with a clear error:

```bash
chorus setup --oracle borzoi --setup-timeout 3600 # 1-hour cap per phase
```

Default is unlimited. If a phase times out, re-run the same command — mamba and HuggingFace downloads resume from where they left off.

#### Tokens

Two tokens are relevant. `chorus setup` surfaces both so they aren't a mid-prediction surprise:
Expand Down Expand Up @@ -1163,7 +1188,7 @@ For each position, the layer-appropriate window-sum is added to the track's rese

At each of the same ~31,500 positions, **32 random bins** from the full output window are added to the perbin reservoir. This captures the per-bin (not per-window) distribution at the track's native resolution (1 bp for ATAC/CAGE/RNA/PRO-CAP/splice; 128 bp for ChIP-Histone/TF in AlphaGenome).

The per-bin CDFs are used by `perbin_floor_rescale_batch` to rescale raw IGV bin values onto a uniform `[0, 1.5]` display scale where `1.0` always corresponds to the top-1% genome-wide bin value for that track. This makes overlaid tracks visually comparable across cell types.
The per-bin CDFs are used by the unified `chorus.analysis._igv_report.rescale_for_display` helper (which all four track-rendering paths — IGV, matplotlib, CoolBox, notebooks — share) to rescale raw bin values onto a uniform `[0, 3.0]` display scale where `1.0` corresponds to the top-1% genome-wide bin value for that track and `3.0` is a hard cap. Signed layers (Borzoi RNA, Sei, LentiMPRA) use the symmetric variant `signed_floor_rescale_batch`, mapping to `[-3.0, +3.0]` with `±1.0 = p99(|effect|)`. This makes overlaid tracks visually comparable across cell types and across renderers.

#### Sample sizes per oracle

Expand Down Expand Up @@ -1241,7 +1266,8 @@ In the resulting report, every track row gets two extra columns — `Effect %ile
| **Effect percentile** (unsigned) | `[0, 1]` | `0.95` = stronger than 95% of ~10K random SNPs in the same track |
| **Effect percentile** (signed) | `[-1, 1]` | `+0.95` = strongly above-baseline gain; `-0.95` = strongly above-baseline loss |
| **Activity percentile** | `[0, 1]` | `0.95` = reference signal at this site is in the top 5% genome-wide for this track |
| **IGV per-bin display value** | `[0, 1.5]` | `1.0` = top-1% bin value genome-wide for this track; `0` = below the noise floor |
| **Display rescale (unsigned)** | `[0, 3.0]` | `1.0` = top-1% bin value genome-wide; `3.0` = hard cap (3× p99); `0` = below the layer floor (p90 / p95 / p85 depending on layer) |
| **Display rescale (signed layers)** | `[-3.0, +3.0]` | `±1.0` = p99 of `|effect|` genome-wide; symmetric so repressive (negative) signals stay visible. Used by Borzoi RNA, Sei, LentiMPRA. |

**Sanity-check rule of thumb:** a *biologically interesting* variant typically shows **effect percentile > 0.95** AND **activity percentile > 0.5** in the same track — i.e. an unusually large effect at a site that already has real regulatory activity.

Expand Down
147 changes: 147 additions & 0 deletions audits/2026-05-08_post_pr79_merge_audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Post-PR-#79 merge audit — handoff for Lorenzo

**Date:** 2026-05-08
**Branch:** `fix/post-v040-followups` (commits `0c2b8e6` + `b2cb4e8` on top of merge `63df601`)
**Auditor:** Luca + Claude (Opus 4.7, 1M context)
**Scope:** verify the merged code (Lorenzo's PR #79 + our local "unify track-rescale" follow-ups) is ready to merge into `main`.

---

## TL;DR

✅ **376 tests pass** (warm-state pytest after fresh CDF swap).
✅ **All 18 walkthrough HTMLs render** with peaks visible — programmatic IGV inspection found 0 issues.
✅ **Default-call behaviour is unified**: IGV, matplotlib, CoolBox, and notebooks all produce CDF-rescaled output with no extra params (same `1.0 = p99` semantics, same `3.0` cap).
✅ **README + walkthroughs README + NORMALIZATION_GUIDE + VISUALIZATION_GUIDE re-read**; 2 P1 stale-claim fixes applied (display range `1.5 → 3.0`).
✅ **README links audited** (51 targets) — all resolve. (Two Zenodo URLs return HTTP 403 to scripted HEAD but are valid via API/browser.)
✅ **Branch pushed** to `origin/fix/post-v040-followups`.

🟡 **One deferred item** (not blocking the PR but worth flagging for a follow-up):
the local DHS-augmented ChromBPNet CDF (built 2026-05-07, 42 tracks, 18,672 effect samples / track) was **not uploaded to HF** — it's missing the 744 BPNet/CHIP tracks that the HF-shipped CDF has. The audit was therefore run against the **HF-shipped 786-track CDF**. SORT1 chrombpnet effect under HF CDF: `+0.318 log2FC, ≥99th %ile`. Same qualitative interpretation as the local-DHS run. To ship the DHS augmentation, rebuild **all 786 tracks** with DHS, then upload — see "Deferred work" below.

---

## What changed in this branch (vs `origin/main`)

```
b2cb4e8 chore(examples): regenerate SORT1 chrombpnet + multi-oracle artefacts
0c2b8e6 feat(viz): unify track-rescale across IGV / matplotlib / CoolBox / notebooks
63df601 Merge Lorenzo's PR #79 into fix/post-v040-followups
965d0dd Added updated multioracle examples (Lorenzo)
fc38632 fix: support mixed-resolution tracks… (Lorenzo)
9151338 fix: genome concurrent decompression race + stale ChromBPNet health probe
fecf407 feat: add chorus cleanup command
0e4fb6a feat: add --setup-timeout to chorus setup
85b12ca fix: CHIP strand suffix mismatch in normalization + alphagenome_pt CDF alias
```

### Lorenzo's PR #79 (kept as-is)

| Change | File | Notes |
|---|---|---|
| `_match_track_id` / `_find_matching_cdf` (perbin → summary → effect fallback) | `normalization.py` | LegNet uses `summary_cdfs` for IGV rescale via the fallback |
| `_calculate_track_bin_size` per-oracle dispatch | `_igv_report.py` | chrombpnet `bin=20`, legnet `bin=resolution`, others `window/3000` |
| `aggregation_method` param on `_downsample_to_features` | `_igv_report.py` | mean / max |
| `windowFunction: "max"` IGV WIG hint for high-res oracles | `_igv_report.py`, `multi_oracle_report.py` | Browser-side aggregation |
| `(per-track norm)` LegNet panel-label suffix | `multi_oracle_report.py` | Tells users the values aren't directly comparable |
| `get_max_output_size()` widens multi-oracle region to ~1 Mb | `regenerate_multioracle.py` | |
| New `t_start = variant_pos - (actual_bp_in_array // 2)` IGV formula | `_igv_report.py` | Identical to our parallel local fix |

### Our local follow-ups (one squashed feat commit)

| Change | File | Why |
|---|---|---|
| Single unified helper `rescale_for_display(values, layer, normalizer, oracle_name, assay_id) → (out, cfg)` | `_igv_report.py` | Single source of truth — IGV, matplotlib, CoolBox, notebooks all use it |
| `apply_floor_rescale` returns 4-tuple `(rescaled, ref, alt, signed)` | `_igv_report.py` | So callers can pick symmetric vs unsigned scale_cfg |
| `signed_floor_rescale_batch` — symmetric signed rescale to `[-3, +3]` using `p99(|cdf|)` | `normalization.py` | Borzoi RNA / Sei / LentiMPRA repressive effects now visible (were clipped to 0) |
| `is_signed()` fuzzy track-id matching incl. CHIP `:+`/`:-` strand stripping | `normalization.py` | LegNet `LentiMPRA:HepG2` correctly resolves to `HepG2` row → signed guard fires |
| `OraclePrediction.add()` backfills `track.assay_id` from dict key | `core/result.py` | ChromBPNet tracks with `assay_id=None` now usable by CoolBox/matplotlib auto-load |
| CoolBox `get_coolbox_representation(normalize=True)` auto-loads normalizer | `core/result.py` | `pred[i].get_coolbox_representation()` with no args → CDF-rescaled output |
| matplotlib `render_track_figures(normalize=True)` auto-loads | `_track_figure.py` | Same default behaviour |
| `_has_samples` guard moved inside `_find_matching_cdf` | `normalization.py` | Failed-build perbin rows fall through to summary instead of saturating to `max_value` |
| `max_value` default `1.5 → 3.0` in `perbin_floor_rescale_batch` | `normalization.py` | Matches `_DISPLAY_MAX = 3.0`; was silent clip-bug for any caller without explicit `max_value=` |
| `ChromBPNetOracle.predict_sliding(seq)` | `oracles/chrombpnet.py` | Slides 2114-bp model across arbitrary intervals with cigar substitutions preserved → ChromBPNet visible across the full multi-oracle 1 Mb locus |
| `_predict()` auto-routes wide queries to `predict_sliding` | `oracles/chrombpnet.py` | PR #79's wider `genomic_region` was triggering a pre-existing IndexError in `_predict_direct`'s sliding formula |
| `_calculate_track_bin_size` chrombpnet uses `agg="max"` (was `"mean"`) | `_igv_report.py` | PR #79's docstring said "max pooling preserves peaks for ChromBPNet" but code returned `"mean"` — corrected |
| Lower per-layer floors: `chromatin_accessibility 0.95→0.90`, `promoter_activity 0.95→0.85` | `_igv_report.py` | Peak base/shoulder visible alongside peak top |
| Causal-report IGV (`causal._build_causal_igv`) goes through the unified helper | `analysis/causal.py` | Same path as variant + multi-oracle reports |
| matplotlib symmetric y-axis fallback for signed layers (no normalizer) | `_track_figure.py` | Repressive RNA/Sei/MPRA signal stays visible in zoom-in/out PNGs |
| DHS-vocabulary utilities `load_dhs_vocabulary()` / `sample_dhs_positions()` | `utils/annotations.py` | Used by the (deferred) DHS-augmented CDF rebuild |
| Multi-oracle wide-locus wiring uses `predict_sliding` | `scripts/regenerate_multioracle.py` | |
| Test updates: `test_apply_floor_rescale_passthrough` (4-tuple); `test_perbin_none_for_scalar_oracles` (perbin → summary fallback); new `test_rescale_for_display_unified_helper` | `tests/test_analysis.py` | |
| README display-range fixes | `README.md` | `1.5 → 3.0` at lines 1191, 1269 |

---

## Verification matrix

| Check | Result | Notes |
|---|---|---|
| `pytest -m "not integration"` | ✅ 376 passed, 1 skipped, 5 deselected | After swapping in HF-shipped CDF |
| ChromBPNet single-oracle SORT1 regen | ✅ `+0.318 log2FC, ≥99th, Activity 0.603` | Matches expected biology |
| Multi-oracle SORT1 (chrombpnet/legnet/alphagenome + consolidate) | ✅ all 4 artefacts regenerated | LegNet panel: scale `[-3, +3]`, 20,174 negative + 797 positive features |
| All 18 walkthrough HTMLs IGV-parsed | ✅ 0 issues — every panel has data | One file (batch_scoring) is table-only, no IGV expected |
| README link audit (51 links) | ✅ all resolve | 2 Zenodo URLs return 403 to scripted HEAD; valid via API |
| Doc consistency (README, walkthroughs README, NORMALIZATION_GUIDE, VISUALIZATION_GUIDE) | ✅ 2 P1 stale claims fixed | `[0, 1.5]` display-range references → `[0, 3.0]` |

### Default-call behaviour (no extra params required)

| Path | Auto-load mechanism | Verified |
|---|---|---|
| IGV variant report | reads `report._normalizer` | ✅ |
| IGV multi-oracle | reads each `rep._normalizer` | ✅ |
| IGV causal | reads `top_s._variant_report._normalizer` | ✅ |
| matplotlib `render_track_figures(...)` | `normalize=True` → `get_normalizer(first.source_model)` | ✅ |
| CoolBox `track.get_coolbox_representation()` | `normalize=True` → `get_normalizer(self.source_model)` | ✅ |

Opt-out: `normalize=False` for matplotlib + CoolBox; `igv_raw=True` on the variant report.

### CDF flow per oracle (Lorenzo's principled-not-hack concern)

| Oracle | Variant `effect_pctile` | Variant `activity_pctile` | IGV per-bin rescale | Notes |
|---|---|---|---|---|
| ChromBPNet | `effect_cdfs` | `summary_cdfs` | `perbin_cdfs` | All three CDFs read directly via the unified helper |
| LegNet | `effect_cdfs` | `summary_cdfs` | `summary_cdfs` (signed) → symmetric `[-3, +3]` rescale via `signed_floor_rescale_batch` | Was per-track autoscale before; now consistent semantics across oracles |
| AlphaGenome / Enformer / Borzoi | `effect_cdfs` | `summary_cdfs` | `perbin_cdfs` | Lorenzo's fallback never triggers (all three present) |
| Sei | `effect_cdfs` (signed) | `summary_cdfs` | n/a — heatmap, not signal track | |

No CDF is bypassed; no hardcoded thresholds; the DHS-augmented samples (when present) are still doing work at every CDF read.

---

## Deferred work (post-merge follow-ups)

1. **DHS-augmented ChromBPNet CDF — rebuild ALL 786 tracks, then upload to HF.**
The 2026-05-07 rebuild only covered the 42 ATAC/DNASE tracks; it dropped the 744 BPNet/CHIP tracks. To ship the DHS augmentation safely, the same DHS-sampling logic in `scripts/build_backgrounds_chrombpnet.py` needs to apply to the BPNet/CHIP build path too, then `huggingface-cli upload` the resulting NPZ.
The unified rendering code is a **no-op** without DHS augmentation (it just uses whatever CDF is on disk), so this can ship later as a pure dataset update.

2. **Doc P2 polish** — surface the unified `rescale_for_display()` helper in `VISUALIZATION_GUIDE.md` (currently only mentioned in `README.md`); add a paragraph explaining that signed layers now use symmetric `[-3, +3]` rescale.

3. **`scripts/regenerate_examples.py` BCL11A/FTO/SORT1_with_CEBP examples** — Lorenzo's PR commented these out. The HTMLs are still in the repo but the regen script won't refresh them. Decide: re-enable, or remove the stale HTMLs.

---

## How to validate the merge before merging into main

```bash
# 1. Pull the branch
git fetch origin fix/post-v040-followups
git checkout fix/post-v040-followups

# 2. Tests (warm — uses your existing envs + cached CDFs)
mamba run -n chorus pytest tests/ -q -m "not integration"
# Expected: 376 passed

# 3. Regenerate the SORT1 multi-oracle (the canonical demo)
mamba run -n chorus-chrombpnet python scripts/regenerate_multioracle.py --oracle chrombpnet
mamba run -n chorus-legnet python scripts/regenerate_multioracle.py --oracle legnet
mamba run -n chorus-alphagenome python scripts/regenerate_multioracle.py --oracle alphagenome
mamba run -n chorus-alphagenome python scripts/regenerate_multioracle.py --consolidate

# 4. Open the report — visually verify the IGV panel
open examples/walkthroughs/validation/SORT1_rs12740374_multioracle/rs12740374_SORT1_multioracle_report.html
# Look for: chrombpnet panel covers full 1 Mb width (predict_sliding); legnet panel
# shows BOTH positive and negative tails (symmetric signed scale, was clipped to 0).
```

— end of audit —
Loading
Loading