Skip to content

Commit 37d11b9

Browse files
lucapinelloclaude
andcommitted
v19 post-checklist audit: fix 5,731 docstring + pin deps + add attribution
First full pass run against audits/AUDIT_CHECKLIST.md (merged in v18). Every §4 / §5 / §10 / §11 / §15 / §16 check produced a clean result. Fixes one remaining source drift, tightens environment.yml, and adds attribution + ship-prep runbook pointer. Findings fixed: - chorus/oracles/alphagenome.py:22 — AlphaGenomeOracle class docstring said "5,930 human functional genomic tracks"; corrected to 5,731 (matches v17 MCP spec + v18 metadata + README hardware matrix). This is what users see in help(oracle). - environment.yml — floor-pins on jupyter>=1.0, notebook>=6.4, ipykernel>=6.0, samtools>=1.15, htslib>=1.15. Prevents a clean-install user from silently getting incompatible majors. - docs/THIRD_PARTY.md (new) — attribution table for all 6 oracles (paper DOIs + licenses), bundled IGV.js library, and the CDF dataset. Linked from README "Further reading". - CLAUDE.md (new, repo root) — points future Claude sessions at audits/AUDIT_CHECKLIST.md as the ship-prep runbook; documents the env matrix and regen workflow so this doesn't get rediscovered. Clean results (per checklist): - §4 CDFs: all 6 oracles monotonic, p50≤p95≤p99, track counts match spec - §5 sequence_length: all 6 match README hardware matrix exactly - §11 tests: 334 passed / 1 skipped / 0 failed - §15 offline HTML: no <script src=> or external <link> in any report - §16 secrets: no tracked file contains an HF token - §18 LICENSE present (MIT); oracle + IGV attributed Deferred P1s (for a focused follow-up PR): - §14 indel pre-validation in predict_variant_effect (needs per-oracle capability flag) - §14 multi-allelic report rendering regression test - §17 pip-audit in release gate Details: audits/2026-04-21_v19_post_checklist_full_pass.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2200ddf commit 37d11b9

6 files changed

Lines changed: 241 additions & 6 deletions

File tree

CLAUDE.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Chorus — notes for Claude sessions
2+
3+
## Audit discipline
4+
5+
Before any ship-prep, release, or "is this ready?" review, run the
6+
audit checklist:
7+
8+
- **[`audits/AUDIT_CHECKLIST.md`](audits/AUDIT_CHECKLIST.md)** — 18-section
9+
reusable runbook (Install → HF gate → GPU → CDFs → Python API →
10+
Notebooks → HTML reports → MCP → Error paths → Repo consistency →
11+
Tests → Reproducibility → Determinism → Edge cases → Offline →
12+
Logging → Dependencies → License). Every check has an exact command
13+
and a P0/P1/P2 severity.
14+
15+
When an audit uncovers findings, write a dated report in
16+
`audits/YYYY-MM-DD_<short-name>.md` following the format used by
17+
`2026-04-21_v18_fresh_full_audit.md` (what was run, what was fixed,
18+
what was deferred, tests-pass summary).
19+
20+
## Environments
21+
22+
Oracle envs are isolated — their deps don't coexist. Always run per-oracle
23+
work through the matching mamba env:
24+
25+
```bash
26+
mamba run -n chorus # base (MCP, analysis, reports)
27+
mamba run -n chorus-alphagenome # JAX
28+
mamba run -n chorus-enformer # TF
29+
mamba run -n chorus-chrombpnet # TF
30+
mamba run -n chorus-borzoi # PyTorch
31+
mamba run -n chorus-sei # PyTorch
32+
mamba run -n chorus-legnet # PyTorch
33+
```
34+
35+
`CUDA_VISIBLE_DEVICES=0|1` respected across all envs. Per-track CDFs
36+
auto-download from
37+
`huggingface.co/datasets/lucapinello/chorus-backgrounds` on first use.
38+
39+
## Regeneration
40+
41+
After any correctness fix (e.g. the ref-allele off-by-one) every
42+
committed example output drifts. Regenerate with:
43+
44+
```bash
45+
python scripts/regenerate_examples.py # walkthroughs
46+
python scripts/regenerate_multioracle.py --oracle <name> # per-oracle
47+
python scripts/regenerate_multioracle.py --consolidate # unified IGV
48+
jupyter nbconvert --to notebook --execute --inplace examples/notebooks/*.ipynb
49+
```
50+
51+
Notebooks must be re-executed on GPU (advanced + comprehensive pull in
52+
multiple oracles; quickstart is CPU-safe).
53+
54+
## Branch flow
55+
56+
Ship branch is `chorus-applications`. Other agents open audit
57+
branches as `audit/YYYY-MM-DD-v<N>-<slug>` and fix branches as
58+
`fix/YYYY-MM-DD-<slug>`. Review then merge into `chorus-applications`;
59+
don't rebase published audit branches.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -979,6 +979,7 @@ After the Quick Start, these documents go deeper:
979979
| [`docs/METHOD_REFERENCE.md`](docs/METHOD_REFERENCE.md) | Method-level reference for advanced users |
980980
| [`docs/VISUALIZATION_GUIDE.md`](docs/VISUALIZATION_GUIDE.md) | pyGenomeTracks + IGV visualization patterns |
981981
| [`docs/IMPLEMENTATION_GUIDE.md`](docs/IMPLEMENTATION_GUIDE.md) | Notes for extending Chorus with new oracles |
982+
| [`docs/THIRD_PARTY.md`](docs/THIRD_PARTY.md) | Upstream oracles, papers, and licenses Chorus builds on |
982983
| [`examples/walkthroughs/`](examples/walkthroughs/) | Worked examples for every MCP tool (variant analysis, batch, causal, discovery, sequence engineering) |
983984

984985
## Contributing
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# v19 post-checklist full pass — 2026-04-21
2+
3+
First audit run **against the new `AUDIT_CHECKLIST.md` runbook** (merged
4+
in v18). Exercised the checklist end-to-end on a Linux/CUDA host to
5+
validate the runbook itself and catch any late drift before ship.
6+
7+
## What was actually run
8+
9+
### §1 — Install smoke
10+
11+
`chorus --help` lists 6 subcommands: `setup`, `list`, `validate`,
12+
`remove`, `health`, `genome`. All resolve.
13+
14+
### §4 — Per-track CDF sanity (all 6 oracles)
15+
16+
| oracle | n_tracks | effect_cdfs monotonic | summary_cdfs p50≤p95≤p99 | signed% | perbin_cdfs |
17+
|---|---|---|---|---|---|
18+
| enformer | 5,313 ||| 0% | yes |
19+
| borzoi | 7,611 ||| 20% | yes |
20+
| chrombpnet | 24 ||| 0% | yes |
21+
| sei | 40 ||| 100% | no |
22+
| legnet | 3 ||| 100% | no |
23+
| alphagenome | 5,168 ||| 12% | yes |
24+
25+
Matches the expected catalog counts exactly. All monotonicity checks pass.
26+
27+
### §5 — `sequence_length` per oracle
28+
29+
All six `create_oracle(..., use_environment=False).sequence_length` values
30+
match the README hardware matrix (Enformer 393,216 · Borzoi 524,288 ·
31+
ChromBPNet 2,114 · Sei 4,096 · LegNet 200 · AlphaGenome 1,048,576).
32+
33+
### §10 — Repo-wide drift grep
34+
35+
Greps from the checklist (`5,930`, `5930`, `196 kbp`,
36+
`examples/applications/`) on tracked files:
37+
38+
- **One real drift fixed**: `chorus/oracles/alphagenome.py:22` — class
39+
docstring said "AlphaGenome predicts 5,930 human functional genomic
40+
tracks"; corrected to **5,731** (matches v17 mcp spec + v18 metadata
41+
fix + README). The class docstring is what users see in `help(oracle)`.
42+
- False-positive matches: bundled `igv.min.js` string literals, Sei /
43+
Borzoi numeric track IDs that happen to contain `5930`, and base64
44+
PNG payloads in committed notebook outputs. None are user-visible doc
45+
drift.
46+
47+
### §11 — Fast test suite
48+
49+
```
50+
mamba run -n chorus python -m pytest tests/ --ignore=tests/test_smoke_predict.py -q
51+
→ 334 passed, 1 skipped, 0 failed (668.92s)
52+
```
53+
54+
Matches the checklist's `≥ 334 pass, ≤ 1 skip, 0 error` target exactly.
55+
56+
### §15 — Offline / air-gapped HTML rendering
57+
58+
`grep -oE '<script[^>]*src=…|<link[^>]*href=http…' examples/walkthroughs/**/*.html`
59+
returned nothing. No report loads external scripts or stylesheets.
60+
(The `cdn|googleapis` hits the checklist warned about turn out to be
61+
string literals *inside* the vendored IGV library — IGV uses them for
62+
its own GCS/Drive file-access features; they are never fetched at
63+
report-load time.)
64+
65+
### §16 — Secrets in tracked files
66+
67+
`git ls-files | xargs grep -lE 'hf_[a-zA-Z0-9]{20,}'` returned nothing.
68+
The top-level `AUDIT_PROMPT_WITH_TOKENS.md` does contain a real HF
69+
token but is **gitignored** and not tracked.
70+
71+
### §17 — Dependency pinning
72+
73+
6 bare deps fixed in `environment.yml`: `jupyter`, `notebook`,
74+
`ipykernel`, `samtools`, `htslib` now carry floor versions
75+
(`>=1.0` / `>=6.4` / `>=6.0` / `>=1.15`). `pip` intentionally left
76+
bare (conda's own packaging primitive).
77+
78+
### §18 — License / attribution
79+
80+
- `LICENSE` present (MIT, 2024 Pinello Lab) — ✓.
81+
- Created **`docs/THIRD_PARTY.md`** enumerating every upstream oracle
82+
with paper DOI and license, the bundled IGV library with its
83+
upstream license URL, and the CDF dataset license. Linked from
84+
`README.md` in the "Further reading" table.
85+
- Bundled `chorus/analysis/static/igv.min.js` does not carry an
86+
upstream MIT header in-line, but its upstream license is cited in
87+
`docs/THIRD_PARTY.md`. If strict header preservation matters for
88+
redistribution, wrap the min.js with its license block on the next
89+
IGV version bump (P2 follow-up, not a ship blocker).
90+
91+
## Fixed in this pass
92+
93+
1. **`chorus/oracles/alphagenome.py:22`** — "5,930" → "5,731" tracks in the
94+
`AlphaGenomeOracle` class docstring. Last remaining source-level
95+
drift after v18. **P1**
96+
2. **`environment.yml`** — floor-pins on `jupyter`, `notebook`,
97+
`ipykernel`, `samtools`, `htslib`. Prevents a clean-install user
98+
from silently getting incompatible majors. **P1**
99+
3. **`docs/THIRD_PARTY.md`** — new file; Oracle + IGV attribution table
100+
with papers + licenses. **P1**
101+
4. **`README.md`** — added THIRD_PARTY.md link in "Further reading". **P2**
102+
5. **`CLAUDE.md`** (new, root of repo) — points future Claude sessions
103+
at `audits/AUDIT_CHECKLIST.md` as the canonical ship-prep runbook,
104+
documents the env matrix and regen workflow.
105+
106+
## Deferred — P1 follow-ups not fixed here
107+
108+
- **§14 indel pre-validation**`OracleBase.predict_variant_effect`
109+
does **not** check `len(ref) == len(alt)` before invoking the model.
110+
For SNV-only oracles this lets indels silently skate through and
111+
potentially crash inside the model wrapper. Each oracle has its own
112+
rule (AlphaGenome handles indels; Enformer/ChromBPNet/Sei/LegNet/Borzoi
113+
don't), so this needs a per-oracle capability flag + a shared guard
114+
in `core/base.py` with a clear "indels not supported by <oracle>"
115+
message. Not a one-line fix; spawning a focused PR is the right move.
116+
- **§14 multi-allelic** — the predict path takes `alleles=['A','C','G','T']`
117+
today; the *report renderer* has not been exercised against > 1 alt
118+
in the current checklist run. Worth writing a regression test.
119+
- **§17 `pip-audit`** — not run here (requires a fresh env create to be
120+
meaningful). Add to the release gate once CI runs the env create.
121+
- **§3 CHORUS_DEVICE=cpu on a GPU host** — listed in checklist as P2;
122+
not exercised this pass.
123+
124+
## Bottom line
125+
126+
Checklist runbook works — every §4/§5/§10/§11/§15/§16 check produced a
127+
clean binary result. One real drift found (`alphagenome.py` docstring,
128+
5,930 → 5,731) and fixed. `environment.yml` tightened, `docs/THIRD_PARTY.md`
129+
added, `CLAUDE.md` established. 334 tests green. No new P0s; three
130+
P1 follow-ups filed for a later PR (§14 indel guard, §14 multi-allelic
131+
report test, pip-audit in release gate).

chorus/oracles/alphagenome.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
class AlphaGenomeOracle(OracleBase):
2020
"""AlphaGenome oracle with automatic environment management.
2121
22-
AlphaGenome (Google DeepMind, Nature 2026) predicts 5,930 human functional
22+
AlphaGenome (Google DeepMind, Nature 2026) predicts 5,731 human functional
2323
genomic tracks at single base-pair resolution from up to 1 MB of DNA
2424
sequence using a JAX-based model.
2525

docs/THIRD_PARTY.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Third-party attribution
2+
3+
Chorus wraps six deep-learning oracles and one genome-browser library.
4+
Each ships under its own license, and model weights are **not**
5+
redistributed in this repo — they are fetched from the original authors'
6+
hosts at first-use time.
7+
8+
## Deep-learning oracles
9+
10+
| Oracle | Authors | Paper | Weights / code license |
11+
|---|---|---|---|
12+
| **Enformer** | Avsec et al., DeepMind | [Effective gene expression prediction from sequence by integrating long-range interactions (Nature Methods 2021)](https://www.nature.com/articles/s41592-021-01252-x) | Apache-2.0 (code); weights on TensorFlow Hub |
13+
| **Borzoi** | Linder et al., Calico Labs | [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation (Nature Genetics 2025)](https://www.nature.com/articles/s41588-024-02053-6) | Apache-2.0 (code); weights on Zenodo |
14+
| **ChromBPNet** | Pampari et al., Kundaje Lab (Stanford) | [ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility (bioRxiv 2024)](https://www.biorxiv.org/content/10.1101/2024.12.25.630221v1) | MIT (code); weights on ENCODE |
15+
| **Sei** | Chen et al., Troyanskaya Lab (Princeton) | [A sequence-based global map of regulatory activity for deciphering human genetics (Nature Genetics 2022)](https://www.nature.com/articles/s41588-022-01102-2) | BSD-3-Clause (code + weights) |
16+
| **LegNet** | Penzar et al., Vaishnav Lab (Broad) | [LegNet: a best-in-class deep learning model for short DNA regulatory regions (Bioinformatics 2023)](https://academic.oup.com/bioinformatics/article/39/8/btad457/7220619) | MIT (code); weights bundled with source |
17+
| **AlphaGenome** | Avsec et al., Google DeepMind | [AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model (Nature 2026)](https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/) | Gated on HuggingFace (`google/alphagenome-all-folds`); accept the license to download weights |
18+
19+
Chorus does not modify the upstream model code beyond the adapter
20+
layer in `chorus/oracles/<name>.py`. Each oracle's predict / score
21+
semantics are those of the original publication.
22+
23+
## Bundled third-party JavaScript
24+
25+
- **IGV.js** (Integrative Genomics Viewer, Robinson et al., Broad/UCSD) —
26+
[igv.org](https://igv.org/), [github.com/igvteam/igv.js](https://github.com/igvteam/igv.js),
27+
MIT license. Shipped as `chorus/analysis/static/igv.min.js` so HTML
28+
reports render offline. Source license at
29+
[github.com/igvteam/igv.js/blob/master/LICENSE](https://github.com/igvteam/igv.js/blob/master/LICENSE).
30+
31+
## Per-track background CDFs
32+
33+
The NPZ CDFs under `~/.chorus/backgrounds/` are derived from the oracle
34+
authors' published predictions on a reference set of genomic loci.
35+
They are computed by Chorus and distributed at
36+
[`huggingface.co/datasets/lucapinello/chorus-backgrounds`](https://huggingface.co/datasets/lucapinello/chorus-backgrounds)
37+
under CC-BY-4.0 — attribute the original oracle publications above
38+
when citing numbers derived from them.
39+
40+
## Chorus itself
41+
42+
MIT-licensed (see [`LICENSE`](../LICENSE)). Cite as:
43+
44+
> Pinello Lab. *Chorus: unified interface for genomic deep-learning oracles.* 2026.

environment.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ dependencies:
1414
- scipy>=1.7.0
1515

1616
# Jupyter for notebooks
17-
- jupyter
18-
- notebook
19-
- ipykernel
17+
- jupyter>=1.0
18+
- notebook>=6.4
19+
- ipykernel>=6.0
2020

2121
# Core utilities
2222
- tqdm>=4.62.0
@@ -26,8 +26,8 @@ dependencies:
2626
- pysam>=0.19.0
2727
- pyfaidx>=0.7.0
2828
- biopython>=1.79
29-
- samtools # For genome indexing
30-
- htslib # Provides bgzip/tabix (coolbox visualization needs bgzip)
29+
- samtools>=1.15 # For genome indexing
30+
- htslib>=1.15 # Provides bgzip/tabix (coolbox visualization needs bgzip)
3131
# gtfsort: Linux-only (bioconda), install separately on Linux: mamba install -c bioconda gtfsort
3232

3333
# Visualization tools

0 commit comments

Comments
 (0)