Skip to content

Commit 3735ea5

Browse files
lucapinelloclaude
andcommitted
Setup prefetch + health classification + token flow
- `chorus setup <oracle>` now pre-downloads weights, backgrounds, and hg38 by default; writes `downloads/<oracle>/.chorus_setup_v1` marker on success. Escape hatches: --no-weights, --no-backgrounds, --no-genome. - `chorus setup --oracle all` (and bare `chorus setup`) resolves the HF token FIRST (CLI flag → env → huggingface_hub.whoami() → interactive prompt) and halts the whole flow if unresolvable — no partial multi-GB download happens. LDlink prompt is non-blocking. - `chorus health` gets a cheap on-disk / auth probe before the 120s subprocess. Missing weights/auth → "Not installed — run `chorus setup <oracle>`" (3 s per oracle) instead of "Unhealthy" after a 120 s hang. - Sei no longer eager-downloads the 3.3 GB Zenodo tarball in __init__; it now lazy-downloads inside load_pretrained_model. list_assay_types / list_class_types fall back to the packaged metadata files in chorus/oracles/sei_source/ so they work without any download. - LegNet: urllib.urlretrieve → shared download_with_resume helper (Range resume + fcntl lock + tqdm progress). - `chorus.utils.http.download_with_resume` grows a tqdm progress bar (falls back to periodic log lines when stdout isn't a TTY). - `chorus.utils.ld` now reads `LDLINK_TOKEN` env + `~/.chorus/config.toml` as fallbacks for the LDlink API token. - `chorus.create_oracle(..., use_environment=False)` now correctly propagates the flag into the oracle instance. - README gains a `## Tokens` section documenting HF + LDlink. Audit: audits/2026-04-23_setup-prefetch-and-health-classification.md (336 passed / 4 deselected / 0 failed on the fast suite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5cd7dae commit 3735ea5

13 files changed

Lines changed: 1064 additions & 66 deletions

File tree

README.md

Lines changed: 35 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -123,16 +123,24 @@ pip install -e .
123123
# have chorus installed)
124124
python -m ipykernel install --user --name chorus --display-name "Python 3 (chorus)"
125125

126-
# 4. Set up at least one oracle environment (see below)
126+
# 4. Set up at least one oracle environment (see below).
127+
# `chorus setup` now pre-downloads weights + backgrounds + hg38 by
128+
# default, so the oracle is ready to predict when the command exits.
127129
chorus setup --oracle enformer # lightweight CPU-friendly starter
128130

129-
# 5. Download the reference genome your analyses will need
130-
chorus genome download hg38
131+
# Or pre-install every oracle at once (HF token prompted up front):
132+
# chorus setup --oracle all
131133

132134
# Verify installation
133135
python -c "import chorus; print(f'chorus {chorus.__version__}')"
134136
```
135137

138+
> **Tokens.** AlphaGenome is a gated HuggingFace model — `chorus setup`
139+
> will prompt for `HF_TOKEN` the first time you pull it (or pass
140+
> `--hf-token`). `LDLINK_TOKEN` is optional and only used by
141+
> `fine_map_causal_variant`; `chorus setup` will offer a non-blocking
142+
> prompt. See the [Tokens](#tokens) section below.
143+
136144
> **Two env files, one source of truth.** The root `environment.yml` is
137145
> what you install. The per-oracle files in `environments/` are consumed
138146
> internally by `chorus setup --oracle <name>` — you don't install them
@@ -179,7 +187,26 @@ You can check the correctness of installation using the following command:
179187
chorus health --timeout 300
180188
```
181189

182-
**Note:** The first health check (or first prediction) for each oracle may take several minutes as model weights are downloaded automatically. Subsequent runs will be much faster.
190+
**Note:** As of the consolidated-setup change, `chorus setup` pre-downloads
191+
each oracle's default weights + background CDFs + the `hg38` reference at
192+
install time, so subsequent `chorus health` / prediction calls are fast.
193+
If you opted out via `--no-weights`, the first prediction will still do a
194+
lazy download.
195+
196+
### Tokens
197+
198+
Two tokens are relevant. `chorus setup` surfaces both so they aren't a
199+
mid-prediction surprise:
200+
201+
| Token | When you need it | How `chorus setup` handles it |
202+
|---|---|---|
203+
| `HF_TOKEN` (HuggingFace) | Required for **AlphaGenome** — the `google/alphagenome-all-folds` model is gated. | Resolved via `--hf-token``HF_TOKEN` / `HUGGING_FACE_HUB_TOKEN` env → existing `huggingface-cli login` → interactive prompt. `chorus setup --oracle all` **halts** the whole flow if no working token can be resolved, so the other 5 oracles aren't built for nothing. |
204+
| `LDLINK_TOKEN` | Optional — only used by `fine_map_causal_variant` (auto-fetch LD proxies from the NIH LDlink REST API). | Non-blocking prompt during `chorus setup --oracle all`. If provided, stored in `~/.chorus/config.toml`; `chorus.utils.ld` also reads `LDLINK_TOKEN` from env. |
205+
206+
Register an HF read token at <https://huggingface.co/settings/tokens>,
207+
then accept the model license at
208+
<https://huggingface.co/google/alphagenome-all-folds>. Register a free
209+
LDlink token at <https://ldlink.nih.gov/?tab=apiaccess>.
183210

184211
### Managing Reference Genomes
185212

@@ -233,10 +260,11 @@ and cached at `~/.chorus/backgrounds/`.
233260

234261
> **The backgrounds dataset is public — no HuggingFace token required.**
235262
> `HF_TOKEN` is only needed for the gated AlphaGenome model itself (see
236-
> the AlphaGenome section below). Causal prioritization with auto-LD-fetch
237-
> needs a separate free LDlink token (see Troubleshooting).
263+
> [Tokens](#tokens) above). Causal prioritization with auto-LD-fetch
264+
> needs a separate free LDlink token.
238265
239-
To pre-download all backgrounds (optional, avoids the first-use wait):
266+
`chorus setup --oracle <name>` pulls the relevant backgrounds for you
267+
automatically (skip with `--no-backgrounds`). To pre-download by hand:
240268

241269
```python
242270
from chorus.analysis.normalization import download_pertrack_backgrounds
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# 2026-04-23 — Setup prefetch + health classification + token flow
2+
3+
Author: Claude (session driven by Luca)
4+
Scope: the consolidated change landing
5+
`chorus/core/weights_probe.py`, `chorus/cli/_setup_prefetch.py`,
6+
`chorus/cli/_setup_all.py`, `chorus/cli/_tokens.py`, plus
7+
modifications to `chorus/cli/main.py`, `chorus/core/environment/runner.py`,
8+
`chorus/__init__.py`, `chorus/utils/http.py`, `chorus/utils/ld.py`,
9+
`chorus/oracles/legnet.py`, `chorus/oracles/sei.py`, and `README.md`.
10+
11+
## What was run
12+
13+
Sections of [`AUDIT_CHECKLIST.md`](AUDIT_CHECKLIST.md) that could be
14+
affected by the change were executed in full; sections that are
15+
orthogonal (§3 GPU, §4 CDF math, §6 notebooks, §7 HTML reports,
16+
§8 MCP, §12/§13 reproducibility, §14 genomics edges, §15 offline,
17+
§17 supply chain, §18 license) were **not** re-run — the change
18+
doesn't touch those code paths.
19+
20+
## Summary
21+
22+
- 336 passed, 4 deselected (integration), 0 failed in the fast suite.
23+
`mamba run -n chorus python -m pytest tests/ --ignore=tests/test_smoke_predict.py -m "not integration" -q`**pass**.
24+
- `chorus health` on a machine with no setup markers: 6 oracles, 7.2 s
25+
total, each clearly reports "Not installed — run `chorus setup <oracle>`"
26+
with the exact missing artifacts. Previously Sei alone hung the
27+
120 s subprocess timeout.
28+
- Health → Healthy transition verified end-to-end with a fabricated
29+
complete state (marker + artifacts).
30+
- `chorus setup --oracle all` without an HF token + non-TTY stdin halts
31+
with rc=1 **before any env build** and emits the three token hints
32+
(`HF_TOKEN`, `--hf-token`, `huggingface-cli login`).
33+
- `create_oracle('fakeOracle')` still raises `ValueError` naming the six
34+
valid options — the `kwargs.setdefault("use_environment", False)` change
35+
in `chorus/__init__.py` is a no-op for unknown oracles (the check
36+
runs first).
37+
- Interactive `HF_TOKEN` prompt uses `getpass` (hidden); `LDLINK_TOKEN`
38+
prompt was switched from `input()` to `getpass` during the audit.
39+
40+
## Per-section findings
41+
42+
### §1 Installation & environment
43+
- [x] **§1.3** `chorus --help` and every subcommand's `--help` render
44+
non-empty (setup: 20 lines, health: 8, genome: 12, etc).
45+
- [x] **§1.6** Idempotency: `chorus setup --oracle enformer --no-weights
46+
--no-backgrounds --no-genome` on an already-present env returns
47+
exit 0 and does not rebuild.
48+
- [x] **§1.9** `~/.chorus/backgrounds/` auto-download: verified by
49+
running `chorus setup --oracle enformer --no-weights --no-genome`
50+
which pulled `enformer_pertrack.npz` from HF in 21 s and wrote
51+
it to the canonical cache.
52+
- [x] **New** Setup marker convention added: `downloads/<oracle>/.chorus_setup_v1`
53+
is the proof-of-install signal read by `chorus health` and
54+
written by `chorus setup` on success. Documented in
55+
`chorus/core/weights_probe.py` docstring.
56+
- [x] **New P2, fixed during audit** `--force` now invalidates the
57+
stale marker up front so a mid-rebuild failure doesn't leave
58+
the oracle reporting Healthy (see `chorus/cli/main.py` and
59+
`chorus/cli/_setup_all.py`).
60+
61+
### §2 HuggingFace authentication
62+
- [x] **§2.1** `HF_TOKEN` env path: verified — `whoami()` succeeds and
63+
we log the user name without exposing the token.
64+
- [x] **§2.3** No-token, no-login path raises a single clear message
65+
that names `HF_TOKEN`, the exact gated repo URL
66+
(`huggingface.co/google/alphagenome-all-folds`), and the
67+
`huggingface-cli login` alternative. All three hints present in
68+
the AlphaGenome error and in the new `chorus setup` halt message.
69+
- [x] **§2.4** Repo URL consistency: the string
70+
`huggingface.co/google/alphagenome-all-folds` appears in
71+
`chorus/oracles/alphagenome.py`, `chorus/oracles/alphagenome_source/templates/load_template.py`,
72+
`README.md` (three places including the new Tokens section), and
73+
the new `chorus/cli/_tokens.py`. No drift.
74+
75+
### §5 Python API sanity
76+
- [x] **§5.1** `create_oracle('<name>', use_environment=False)` works
77+
for all 6 oracles (verified for legnet under `chorus-legnet` env).
78+
Invalid name raises `ValueError: Unknown oracle: fakeoracle.
79+
Available: ['enformer', 'borzoi', 'chrombpnet', 'sei', 'legnet',
80+
'alphagenome']`.
81+
- [x] **New behaviour** `use_environment=False` now correctly
82+
propagates into the oracle instance via
83+
`kwargs.setdefault("use_environment", False)`. Previously the
84+
oracle would default to `use_environment=True` inside the
85+
"direct" branch, which made the `chorus setup` prefetch script
86+
re-spawn a subprocess back into the env it was already running
87+
in. Covered by a bespoke smoke test during the audit.
88+
- [x] **§5.4** `predict_variant_effect` 1-based coordinate regression
89+
(`tests/test_prediction_methods.py::test_variant_position_is_1_based`)
90+
still passes (13/13 in test_prediction_methods.py).
91+
92+
### §9 Error messages
93+
- [x] `create_oracle('fakeOracle')` names the six valid options.
94+
- [x] AlphaGenome HF token missing → message contains `HF_TOKEN`,
95+
gated repo URL, and `huggingface-cli login`.
96+
- [x] Network drop during `download_pertrack_backgrounds` returns 0
97+
and logs a warning (`tests/test_error_recovery.py::TestDownloadFailurePaths`
98+
2/2 pass).
99+
- [x] `chorus setup --oracle all` halt message names `HF_TOKEN`,
100+
`--hf-token`, and `huggingface-cli login` — all three paths.
101+
- [x] `test_missing_oracle_env_falls_back_gracefully` still passes.
102+
- [x] `test_download_with_resume_leaves_partial_and_resumes_on_second_call`
103+
still passes after the tqdm integration.
104+
105+
### §10 Consistency of claims across the repo
106+
- [x] Drift grep
107+
(`grep -rn '5,930\|5930\|196 kbp\|examples/applications/' --include='*.md' --include='*.py' .`
108+
excluding `audits/`) returns nothing.
109+
- [x] No TODO/FIXME/WIP markers in any of the 12 changed/new files.
110+
- [x] README "Tokens" section (new) names both tokens consistently
111+
with `chorus/cli/_tokens.py` resolution order. LDlink
112+
Troubleshooting block pre-existed and is now backed by the
113+
`LDLINK_TOKEN` env var + `~/.chorus/config.toml` fallback added
114+
to `chorus/utils/ld.py`.
115+
116+
### §11 Test suite
117+
- [x] **Fast suite** `mamba run -n chorus python -m pytest tests/
118+
--ignore=tests/test_smoke_predict.py -m "not integration" -q`
119+
→ 336 passed, 4 deselected, 0 failed, 72.7 s (threshold ≥334).
120+
- [N/A] Integration suite not run (no release-host access in this
121+
session). Flagged for the release-host auditor.
122+
123+
### §16 Logging hygiene
124+
- [x] `grep -rn 'hf_[a-zA-Z0-9]\{20,\}' chorus/ examples/ docs/ audits/`
125+
returns nothing — no real tokens committed.
126+
- [x] All logger.info / logger.error calls in `chorus/cli/_tokens.py`
127+
log token **metadata** (source path, `whoami()` user name,
128+
success/rejection) but never the token value itself.
129+
- [x] Interactive prompts use `getpass` (hidden stdin) for both HF and
130+
LDlink after a polish during the audit.
131+
132+
## Things deferred (not blocking this change)
133+
- P1 §1.4 Running `chorus setup --oracle <X>` end-to-end on a fresh
134+
Linux/CUDA host and on macOS-arm64 — requires release hosts we don't
135+
have in this session.
136+
- P1 §11 Integration-marked suite — same.
137+
- P1 §8 MCP E2E for rs12740374 — no changes to MCP code; skipped.
138+
139+
## Files touched
140+
```
141+
R chorus/__init__.py (+5 lines)
142+
R chorus/cli/main.py (+141 / -lines reorganized)
143+
R chorus/core/environment/runner.py (+19 lines, probe wire-up)
144+
R chorus/oracles/legnet.py (urlretrieve → download_with_resume)
145+
R chorus/oracles/sei.py (lazy download + packaged-metadata fallback)
146+
R chorus/utils/http.py (tqdm integration)
147+
R chorus/utils/ld.py (LDLINK_TOKEN env + config fallback)
148+
R README.md (+Tokens section)
149+
+ chorus/cli/_setup_all.py (93 lines)
150+
+ chorus/cli/_setup_prefetch.py (173 lines)
151+
+ chorus/cli/_tokens.py (205 lines)
152+
+ chorus/core/weights_probe.py (140 lines)
153+
```
154+
155+
## Verdict
156+
**Green.** Safe to commit the 8 modifications + 4 new files, but
157+
**do not `git add .` or `git add -A`**`Untitled.ipynb` is a
158+
pre-existing untracked stray (Apr 22, not part of this change) and
159+
must be left out. Recommend a selective `git add` of the 12 files
160+
listed above.

chorus/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,11 @@ def create_oracle(oracle_name: str, use_environment: bool = False, **kwargs):
190190
"2. Install oracle dependencies in the current environment"
191191
)
192192
oracle_class = get_oracle(oracle_name)
193+
# Propagate use_environment=False explicitly. Oracle __init__
194+
# defaults to use_environment=True, so without this the returned
195+
# instance would re-spawn a subprocess back into the same env
196+
# when load_pretrained_model() is called.
197+
kwargs.setdefault("use_environment", False)
193198
return oracle_class(**kwargs)
194199

195200
__all__ = [

chorus/cli/_setup_all.py

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
"""`chorus setup all` — orchestrates end-to-end setup of every oracle.
2+
3+
Flow:
4+
1. Resolve the HuggingFace token (blocking — no fallback). If the
5+
user cannot produce a working token, halt BEFORE any env build
6+
or download runs, because AlphaGenome cannot proceed without it
7+
and the user asked for "all".
8+
2. Optionally prompt for an LDlink token (non-blocking).
9+
3. For each oracle, build the conda env, pre-download weights,
10+
background CDFs, and (once) the hg38 reference, then write the
11+
setup-complete marker.
12+
13+
The single-oracle flow (``chorus setup <oracle>``) lives in
14+
``main.setup_environments``; ``setup all`` is intentionally a separate
15+
entry so its stricter gating is obvious from the call graph.
16+
"""
17+
18+
from __future__ import annotations
19+
20+
import logging
21+
22+
logger = logging.getLogger(__name__)
23+
24+
25+
def setup_all_oracles(args) -> int:
26+
"""Implementation of ``chorus setup --oracle all``."""
27+
from ..core.environment import EnvironmentManager, EnvironmentRunner
28+
from ..core.weights_probe import write_setup_marker
29+
from ._setup_prefetch import prefetch_for_oracle
30+
from ._tokens import prompt_ldlink_token, resolve_hf_token
31+
32+
manager = EnvironmentManager()
33+
oracles = manager.list_available_oracles()
34+
if not oracles:
35+
logger.error("No oracle environment definitions found.")
36+
return 1
37+
38+
# HF token gate — must resolve BEFORE we start downloading 10+ GB of
39+
# env + weights only to fail on the last oracle.
40+
if not args.no_weights:
41+
if not resolve_hf_token(
42+
cli_token=getattr(args, "hf_token", None),
43+
interactive=True,
44+
):
45+
logger.error(
46+
"`chorus setup all` halted: a working HuggingFace token is "
47+
"required for AlphaGenome. Nothing was downloaded. "
48+
"Set HF_TOKEN, run 'huggingface-cli login', or pass "
49+
"--hf-token and retry."
50+
)
51+
return 1
52+
53+
# Non-blocking LDlink prompt.
54+
prompt_ldlink_token(interactive=True)
55+
56+
runner = EnvironmentRunner(manager)
57+
success_count = 0
58+
for oracle in oracles:
59+
logger.info(f"\n=== Setting up {oracle} ===")
60+
if args.force:
61+
from ..core.weights_probe import setup_marker_path
62+
stale = setup_marker_path(oracle)
63+
if stale.exists():
64+
stale.unlink()
65+
if not manager.create_environment(oracle, force=args.force):
66+
logger.error(f"✗ Failed to build env for {oracle}")
67+
continue
68+
logger.info(f"✓ env for {oracle}")
69+
70+
if args.no_weights and args.no_backgrounds and args.no_genome:
71+
logger.info(f"Skipping all data prefetch for {oracle}")
72+
success_count += 1
73+
continue
74+
75+
ok, errors = prefetch_for_oracle(
76+
oracle,
77+
runner,
78+
skip_weights=args.no_weights,
79+
skip_backgrounds=args.no_backgrounds,
80+
skip_genome=args.no_genome,
81+
)
82+
if not ok:
83+
logger.error(f"✗ prefetch failed for {oracle}:")
84+
for err in errors:
85+
logger.error(f" - {err}")
86+
continue
87+
88+
if args.no_weights:
89+
logger.info(
90+
f"✓ {oracle} env ready (weights skipped — setup marker NOT written)"
91+
)
92+
else:
93+
write_setup_marker(oracle)
94+
logger.info(f"✓ {oracle} ready")
95+
success_count += 1
96+
97+
logger.info(f"\nSetup all complete: {success_count}/{len(oracles)} oracles ready.")
98+
return 0 if success_count == len(oracles) else 1

0 commit comments

Comments
 (0)