Skip to content

Commit abfc7a3

Browse files
lucapinelloclaude
andcommitted
v21 fresh-install Linux/CUDA audit: close deferred §1 + fix 2 P1s
Closes the §1-P0 item deferred since v18 ("truly-fresh mamba env create on Linux/CUDA"). Ran AUDIT_CHECKLIST.md §1, §2, §4, §5, §10, §11, §15, §16, §17, §18 end-to-end against a new mamba env created from environment.yml with HOME redirected to a scratch dir so ~/.chorus, ~/.cache/huggingface, and every other home-default cache was empty at start. Two real P1s surfaced + fixed: 1. examples/notebooks/comprehensive_oracle_showcase.ipynb cells 1 and 21 said "LegNet ... 230 bp". Library reports sequence_length=200. Updated both cells to "200 bp". Last 230-bp drift in live docs. 2. tests/test_integration.py::test_chrombpnet_fresh_single_model_download failed on fresh env with ModuleNotFoundError: 'tensorflow'. Cascade: no chorus-chrombpnet env exists → use_environment=True falls back to direct load (correct graceful behavior) → base env has no TF → ModelNotLoadedError. Fix: skip guard at top of test that checks EnvironmentManager.environment_exists("chrombpnet") and skips with a "run chorus setup --oracle chrombpnet first" message. Down from 442s fail to 2.24s skip. All other §1-§18 results clean on fresh env: - §1 chorus --help + genome download + pip install -e . OK - §4 CDFs auto-downloaded fresh from HF for all 6 oracles, every monotonic, p50≤p95≤p99, catalog count correct - §5 sequence_length matches spec for all 6 oracles; unknown oracle raises ValueError with actionable "Available: [...]" list - §11 fast suite: 332 passed / 1 skipped / 0 failed / 4 deselected (integration) in 43s. 332 + 3 passing integration tests on the main dev env = 335 — matches v20 count. - §15 zero external <script src=http> / <link href=http> in any shipped walkthrough HTML — offline-safe - §16 zero HF tokens in tracked files - §17 pip-audit clean: "No known vulnerabilities found" (confirms v20 pillow>=12.2.0 pin took effect) - §18 LICENSE + docs/THIRD_PARTY.md + audits/AUDIT_CHECKLIST.md all present Remaining deferreds (unchanged from v20): §14 indel pre-validation, §14 multi-allelic report regression test, §14.5 near-telomere with wide-window oracle, §3 CHORUS_DEVICE=cpu on GPU host. Details: audits/2026-04-21_v21_fresh_install_linux_cuda.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d23da66 commit abfc7a3

3 files changed

Lines changed: 161 additions & 2 deletions

File tree

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# v21 fresh-install Linux/CUDA audit — 2026-04-21
2+
3+
The §1-P0 item that had been deferred since v18 ("truly-fresh
4+
`mamba env create` on a clean Linux/CUDA box") — run end-to-end on
5+
this host with maximally isolated caches.
6+
7+
## Isolation approach
8+
9+
- `SCRATCH=/tmp/chorus-audit-v21/`
10+
- `HOME=$SCRATCH/home` during the audit → `~/.chorus`,
11+
`~/.cache/huggingface`, and every other "default to home" cache
12+
writes into the scratch tree, not the user's real home.
13+
- Fresh env created via `mamba env create -f environment.yml --prefix
14+
$SCRATCH/env` — does not touch
15+
`/data/pinello/SHARED_SOFTWARE/envs/lp698_envs/chorus-*`.
16+
- Only the HF token was copied into `$SCRATCH/home/.cache/huggingface/`
17+
so HF-gated auto-downloads could proceed.
18+
19+
Artefacts: [`/tmp/chorus-audit-v21/artifacts/`](/tmp/chorus-audit-v21/artifacts/)
20+
(not committed — reproducible by re-running the audit).
21+
22+
## What was actually run
23+
24+
### §1 — install smoke (fresh env)
25+
26+
- `mamba env create -f environment.yml --prefix $SCRATCH/env`**PASS** (9 min, 844-package transaction)
27+
- `pip install -e .` inside the new env — **PASS**
28+
- `chorus --help`**PASS** (6 subcommands)
29+
- `chorus genome download hg38`**PASS** (fresh 3 GB download, indexed into `genomes/hg38.fa`)
30+
31+
### §2 — HF gate (with real token)
32+
33+
- `HF_TOKEN` honoured via `~/.cache/huggingface/token` in scratch HOME.
34+
- AlphaGenome auto-download worked (see §4).
35+
36+
### §4 — fresh CDF auto-download (all 6 oracles, first-use from HF)
37+
38+
Every CDF NPZ was downloaded fresh from
39+
[`huggingface.co/datasets/lucapinello/chorus-backgrounds`](https://huggingface.co/datasets/lucapinello/chorus-backgrounds)
40+
on first `get_normalizer()` call. No pre-existing `~/.chorus/backgrounds`.
41+
42+
| oracle | n_tracks | effect_cdfs monotonic | summary_cdfs p50≤p95≤p99 | signed% | perbin_cdfs | status |
43+
|---|---|---|---|---|---|---|
44+
| enformer | 5,313 ||| 0% | yes | OK |
45+
| borzoi | 7,611 ||| 20% | yes | OK |
46+
| chrombpnet | 24 ||| 0% | yes | OK |
47+
| sei | 40 ||| 100% | no | OK |
48+
| legnet | 3 ||| 100% | no | OK |
49+
| alphagenome | 5,168 ||| 12% | yes | OK |
50+
51+
### §5 — Python API sanity
52+
53+
- `sequence_length` for all 6 oracles matches the expected spec.
54+
- `create_oracle('fakeOracle')``ValueError: Unknown oracle: fakeoracle. Available: ['enformer', 'borzoi', 'chrombpnet', 'sei', 'legnet', 'alphagenome']` — clean, actionable.
55+
56+
### §10 — repo-wide drift grep
57+
58+
- **Found 1 live drift, fixed in this PR**: `examples/notebooks/comprehensive_oracle_showcase.ipynb` markdown cells 1 + 21 said `"LegNet … 230 bp"`. Actual input is 200 bp (matches `create_oracle('legnet').sequence_length`). Updated both cells.
59+
- No other drifts in tracked code (false positives: `common_snps_500.bed` coordinate `49325929-49325930`, IGV bundled JS internals, Sei/Borzoi numeric track IDs, base64 PNG payloads in notebook outputs).
60+
61+
### §11 — fast test suite (fresh env)
62+
63+
```
64+
pytest tests/ --ignore=tests/test_smoke_predict.py -m "not integration" -q
65+
→ 332 passed, 1 skipped, 4 deselected in 43s
66+
```
67+
68+
(`332 passed + 3 successfully-running integration tests on the main
69+
dev env = 335 — matches the existing-env count from v20.`)
70+
71+
### §11 — integration test: found a real P1, fixed here
72+
73+
`tests/test_integration.py::test_chrombpnet_fresh_single_model_download`
74+
failed on the fresh env with `ModuleNotFoundError: No module named
75+
'tensorflow'`. Root cause cascade:
76+
77+
1. User's new env only has `chorus` base env — no `chorus-chrombpnet`.
78+
2. Test instantiates oracle with `use_environment=True`.
79+
3. `create_oracle` sees the env is missing → gracefully falls back to
80+
`use_environment=False` (correct behaviour, per v5).
81+
4. Direct load tries to `import tensorflow` in the base env → not
82+
installed → raises `ModelNotLoadedError`.
83+
84+
**Fix (in this PR)**: added a skip guard at the top of the test:
85+
86+
```python
87+
from chorus.core.environment.manager import EnvironmentManager
88+
if not EnvironmentManager().environment_exists("chrombpnet"):
89+
pytest.skip(
90+
"chorus-chrombpnet env missing — run `chorus setup --oracle chrombpnet` first. "
91+
"Without it, the subprocess oracle runner falls back to direct load which needs "
92+
"TensorFlow in the base env (not installed by default)."
93+
)
94+
```
95+
96+
Verified: re-run on fresh env → **1 skipped in 2.24s** (down from
97+
442 s fail).
98+
99+
### §15 — offline HTML rendering
100+
101+
No external `<script src=http…>` / `<link href=http…>` in any shipped
102+
walkthrough HTML — offline-safe. Confirms v19 finding on fresh checkout.
103+
104+
### §16 — secrets
105+
106+
No HF tokens in tracked files (`git ls-files | xargs grep -lE 'hf_...'`
107+
returns nothing). Confirms v19.
108+
109+
### §17 — pip-audit on fresh env
110+
111+
First attempt ran under the wrong Python (v21 audit-script bug — picked
112+
up `/PHShome/lp698/.local/bin/pip` which is Python 2.7). Fixed by
113+
invoking `$ENV/bin/pip-audit` directly:
114+
115+
```
116+
No known vulnerabilities found
117+
```
118+
119+
`pillow>=12.2.0` pin from v20 took effect; no unresolved CVEs on the
120+
fresh env.
121+
122+
### §18 — LICENSE + attribution
123+
124+
- `LICENSE` MIT — ✓
125+
- `docs/THIRD_PARTY.md` — ✓ (shipped in v19)
126+
- `audits/AUDIT_CHECKLIST.md` — ✓
127+
128+
## Fixed in this PR
129+
130+
1. **`examples/notebooks/comprehensive_oracle_showcase.ipynb`** — LegNet
131+
"230 bp" → "200 bp" in cell 1 (hardware table) and cell 21 (LegNet
132+
section header). Matches library `sequence_length=200`.
133+
2. **`tests/test_integration.py`** — skip guard added to
134+
`test_chrombpnet_fresh_single_model_download` so fresh-install
135+
users without a `chorus-chrombpnet` env see a helpful skip message
136+
instead of a TF `ModuleNotFoundError`.
137+
138+
## What the fresh install proved
139+
140+
- §1 install path works end-to-end from a clean state on Linux/CUDA.
141+
- §4 CDF auto-download from HF works fresh for all 6 oracles.
142+
- §5 API is clean.
143+
- §11 fast suite is fully green.
144+
- §17 no CVEs thanks to v20 pillow pin.
145+
- Every cache path correctly respects `HOME` — no escapes into real
146+
`~/.chorus` or `~/.cache/huggingface`.
147+
148+
Checkbox status of the deferred §1 item: **closed**. Remaining
149+
deferreds are §14 indel pre-validation, §14 multi-allelic report
150+
regression test, §14.5 near-telomere with wide-window oracle,
151+
§3 `CHORUS_DEVICE=cpu` on GPU host.

examples/notebooks/comprehensive_oracle_showcase.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@
4646
"| **Borzoi** | PyTorch | 524 kb | 32 bp | 7,611 (DNASE, CAGE, CHIP, ATAC, RNA) |\n",
4747
"| **ChromBPNet** | TensorFlow | 2 kb | 1 bp | Per-model (DNASE, ATAC) |\n",
4848
"| **Sei** | PyTorch | 4 kb | 128 bp | 21,907 (histone marks, TF binding, sequence classes) |\n",
49-
"| **LegNet** | PyTorch | 230 bp | 1 bp | 1 (LentiMPRA) |\n",
49+
"| **LegNet** | PyTorch | 200 bp | 1 bp | 1 (LentiMPRA) |\n",
5050
"| **AlphaGenome** | JAX | 1 Mb | 1 bp | 5,731 (DNASE, CAGE, CHIP, ATAC, RNA, PRO-CAP, splice sites) |\n",
5151
"\n",
5252
"**Operations demonstrated:**\n",
@@ -926,7 +926,7 @@
926926
"---\n",
927927
"## 5. LegNet — PyTorch, LentiMPRA predictions\n",
928928
"\n",
929-
"LegNet predicts Lentiviral MPRA (Massively Parallel Reporter Assay) activity from 230 bp sequences. It measures regulatory element activity as a continuous score."
929+
"LegNet predicts Lentiviral MPRA (Massively Parallel Reporter Assay) activity from 200 bp sequences. It measures regulatory element activity as a continuous score."
930930
]
931931
},
932932
{

tests/test_integration.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,14 @@ def test_chrombpnet_fresh_single_model_download(tmp_path):
101101
if not Path(reference_fasta).exists():
102102
pytest.skip("hg38.fa missing — run `chorus genome download hg38` first")
103103

104+
from chorus.core.environment.manager import EnvironmentManager
105+
if not EnvironmentManager().environment_exists("chrombpnet"):
106+
pytest.skip(
107+
"chorus-chrombpnet env missing — run `chorus setup --oracle chrombpnet` first. "
108+
"Without it, the subprocess oracle runner falls back to direct load which needs "
109+
"TensorFlow in the base env (not installed by default)."
110+
)
111+
104112
oracle = chorus.create_oracle(
105113
"chrombpnet", use_environment=True, reference_fasta=reference_fasta,
106114
)

0 commit comments

Comments
 (0)