|
| 1 | +# v19 post-checklist full pass — 2026-04-21 |
| 2 | + |
| 3 | +First audit run **against the new `AUDIT_CHECKLIST.md` runbook** (merged |
| 4 | +in v18). Exercised the checklist end-to-end on a Linux/CUDA host to |
| 5 | +validate the runbook itself and catch any late drift before ship. |
| 6 | + |
| 7 | +## What was actually run |
| 8 | + |
| 9 | +### §1 — Install smoke |
| 10 | + |
| 11 | +`chorus --help` lists 6 subcommands: `setup`, `list`, `validate`, |
| 12 | +`remove`, `health`, `genome`. All resolve. |
| 13 | + |
| 14 | +### §4 — Per-track CDF sanity (all 6 oracles) |
| 15 | + |
| 16 | +| oracle | n_tracks | effect_cdfs monotonic | summary_cdfs p50≤p95≤p99 | signed% | perbin_cdfs | |
| 17 | +|---|---|---|---|---|---| |
| 18 | +| enformer | 5,313 | ✓ | ✓ | 0% | yes | |
| 19 | +| borzoi | 7,611 | ✓ | ✓ | 20% | yes | |
| 20 | +| chrombpnet | 24 | ✓ | ✓ | 0% | yes | |
| 21 | +| sei | 40 | ✓ | ✓ | 100% | no | |
| 22 | +| legnet | 3 | ✓ | ✓ | 100% | no | |
| 23 | +| alphagenome | 5,168 | ✓ | ✓ | 12% | yes | |
| 24 | + |
| 25 | +Matches the expected catalog counts exactly. All monotonicity checks pass. |
| 26 | + |
| 27 | +### §5 — `sequence_length` per oracle |
| 28 | + |
| 29 | +All six `create_oracle(..., use_environment=False).sequence_length` values |
| 30 | +match the README hardware matrix (Enformer 393,216 · Borzoi 524,288 · |
| 31 | +ChromBPNet 2,114 · Sei 4,096 · LegNet 200 · AlphaGenome 1,048,576). |
| 32 | + |
| 33 | +### §10 — Repo-wide drift grep |
| 34 | + |
| 35 | +Greps from the checklist (`5,930`, `5930`, `196 kbp`, |
| 36 | +`examples/applications/`) on tracked files: |
| 37 | + |
| 38 | +- **One real drift fixed**: `chorus/oracles/alphagenome.py:22` — class |
| 39 | + docstring said "AlphaGenome predicts 5,930 human functional genomic |
| 40 | + tracks"; corrected to **5,731** (matches v17 mcp spec + v18 metadata |
| 41 | + fix + README). The class docstring is what users see in `help(oracle)`. |
| 42 | +- False-positive matches: bundled `igv.min.js` string literals, Sei / |
| 43 | + Borzoi numeric track IDs that happen to contain `5930`, and base64 |
| 44 | + PNG payloads in committed notebook outputs. None are user-visible doc |
| 45 | + drift. |
| 46 | + |
| 47 | +### §11 — Fast test suite |
| 48 | + |
| 49 | +``` |
| 50 | +mamba run -n chorus python -m pytest tests/ --ignore=tests/test_smoke_predict.py -q |
| 51 | +→ 334 passed, 1 skipped, 0 failed (668.92s) |
| 52 | +``` |
| 53 | + |
| 54 | +Matches the checklist's `≥ 334 pass, ≤ 1 skip, 0 error` target exactly. |
| 55 | + |
| 56 | +### §15 — Offline / air-gapped HTML rendering |
| 57 | + |
| 58 | +`grep -oE '<script[^>]*src=…|<link[^>]*href=http…' examples/walkthroughs/**/*.html` |
| 59 | +returned nothing. No report loads external scripts or stylesheets. |
| 60 | +(The `cdn|googleapis` hits the checklist warned about turn out to be |
| 61 | +string literals *inside* the vendored IGV library — IGV uses them for |
| 62 | +its own GCS/Drive file-access features; they are never fetched at |
| 63 | +report-load time.) |
| 64 | + |
| 65 | +### §16 — Secrets in tracked files |
| 66 | + |
| 67 | +`git ls-files | xargs grep -lE 'hf_[a-zA-Z0-9]{20,}'` returned nothing. |
| 68 | +The top-level `AUDIT_PROMPT_WITH_TOKENS.md` does contain a real HF |
| 69 | +token but is **gitignored** and not tracked. |
| 70 | + |
| 71 | +### §17 — Dependency pinning |
| 72 | + |
| 73 | +6 bare deps fixed in `environment.yml`: `jupyter`, `notebook`, |
| 74 | +`ipykernel`, `samtools`, `htslib` now carry floor versions |
| 75 | +(`>=1.0` / `>=6.4` / `>=6.0` / `>=1.15`). `pip` intentionally left |
| 76 | +bare (conda's own packaging primitive). |
| 77 | + |
| 78 | +### §18 — License / attribution |
| 79 | + |
| 80 | +- `LICENSE` present (MIT, 2024 Pinello Lab) — ✓. |
| 81 | +- Created **`docs/THIRD_PARTY.md`** enumerating every upstream oracle |
| 82 | + with paper DOI and license, the bundled IGV library with its |
| 83 | + upstream license URL, and the CDF dataset license. Linked from |
| 84 | + `README.md` in the "Further reading" table. |
| 85 | +- Bundled `chorus/analysis/static/igv.min.js` does not carry an |
| 86 | + upstream MIT header in-line, but its upstream license is cited in |
| 87 | + `docs/THIRD_PARTY.md`. If strict header preservation matters for |
| 88 | + redistribution, wrap the min.js with its license block on the next |
| 89 | + IGV version bump (P2 follow-up, not a ship blocker). |
| 90 | + |
| 91 | +## Fixed in this pass |
| 92 | + |
| 93 | +1. **`chorus/oracles/alphagenome.py:22`** — "5,930" → "5,731" tracks in the |
| 94 | + `AlphaGenomeOracle` class docstring. Last remaining source-level |
| 95 | + drift after v18. **P1** |
| 96 | +2. **`environment.yml`** — floor-pins on `jupyter`, `notebook`, |
| 97 | + `ipykernel`, `samtools`, `htslib`. Prevents a clean-install user |
| 98 | + from silently getting incompatible majors. **P1** |
| 99 | +3. **`docs/THIRD_PARTY.md`** — new file; Oracle + IGV attribution table |
| 100 | + with papers + licenses. **P1** |
| 101 | +4. **`README.md`** — added THIRD_PARTY.md link in "Further reading". **P2** |
| 102 | +5. **`CLAUDE.md`** (new, root of repo) — points future Claude sessions |
| 103 | + at `audits/AUDIT_CHECKLIST.md` as the canonical ship-prep runbook, |
| 104 | + documents the env matrix and regen workflow. |
| 105 | + |
| 106 | +## Deferred — P1 follow-ups not fixed here |
| 107 | + |
| 108 | +- **§14 indel pre-validation** — `OracleBase.predict_variant_effect` |
| 109 | + does **not** check `len(ref) == len(alt)` before invoking the model. |
| 110 | + For SNV-only oracles this lets indels silently skate through and |
| 111 | + potentially crash inside the model wrapper. Each oracle has its own |
| 112 | + rule (AlphaGenome handles indels; Enformer/ChromBPNet/Sei/LegNet/Borzoi |
| 113 | + don't), so this needs a per-oracle capability flag + a shared guard |
| 114 | + in `core/base.py` with a clear "indels not supported by <oracle>" |
| 115 | + message. Not a one-line fix; spawning a focused PR is the right move. |
| 116 | +- **§14 multi-allelic** — the predict path takes `alleles=['A','C','G','T']` |
| 117 | + today; the *report renderer* has not been exercised against > 1 alt |
| 118 | + in the current checklist run. Worth writing a regression test. |
| 119 | +- **§17 `pip-audit`** — not run here (requires a fresh env create to be |
| 120 | + meaningful). Add to the release gate once CI runs the env create. |
| 121 | +- **§3 CHORUS_DEVICE=cpu on a GPU host** — listed in checklist as P2; |
| 122 | + not exercised this pass. |
| 123 | + |
| 124 | +## Bottom line |
| 125 | + |
| 126 | +Checklist runbook works — every §4/§5/§10/§11/§15/§16 check produced a |
| 127 | +clean binary result. One real drift found (`alphagenome.py` docstring, |
| 128 | +5,930 → 5,731) and fixed. `environment.yml` tightened, `docs/THIRD_PARTY.md` |
| 129 | +added, `CLAUDE.md` established. 334 tests green. No new P0s; three |
| 130 | +P1 follow-ups filed for a later PR (§14 indel guard, §14 multi-allelic |
| 131 | +report test, pip-audit in release gate). |
0 commit comments