Skip to content

v21 fresh-install audit: data caches purged + re-downloaded, no findings#37

Closed
lucapinello wants to merge 1 commit intochorus-applicationsfrom
audit/2026-04-21-v21-fresh-install-audit
Closed

v21 fresh-install audit: data caches purged + re-downloaded, no findings#37
lucapinello wants to merge 1 commit intochorus-applicationsfrom
audit/2026-04-21-v21-fresh-install-audit

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Summary

Walked audits/AUDIT_CHECKLIST.md top-to-bottom with every re-downloadable cache purged before starting. No new findings — every mechanisable checklist item passes on a fully fresh-data run.

What was nuked + re-downloaded

Cache Before Re-downloaded Time
~/.chorus/backgrounds/ (6 NPZ CDFs) 1.5 GB HuggingFace dataset 44 s total
genomes/hg38.fa + .fai 3.0 GB chorus genome download hg38 → UCSC + decompress + samtools faidx ~9 min

Results — all pass

§ Result
1 CLI chorus --help surfaces all 6 subcommands
3 GPU all 6 envs detect Metal/MPS/JAX-METAL on macOS arm64
4 CDFs (fresh from HF) 6/6 monotonic, p50≤p95≤p99, signed% correct (0/20/0/100/100/13)
5 Python API sequence_length matches spec for all 6; errors clear
6 Notebook fresh exec single_oracle_quickstart.ipynb: 49 cells, 0 errors, 0 warnings
7 HTML (selenium) 18/18 render with fresh Chrome profile, 0 JS errors each
10 Consistency zero drift — grep '5,930|7,612|196 kbp|examples/applications/' empty
11 pytest 335 passed / 1 skipped (8m 28s)
14.4 chrom validation regression test holds
15 Offline 0 runtime CDN fetches across all 18 HTMLs
16 Logging hygiene no committed tokens
18 License LICENSE + docs/THIRD_PARTY.md + bundled IGV.js header intact

Scope deliberately deferred to release-host audit

  • §1 full conda env recreate — 80 GB / 2–4 h, destructive to ongoing work
  • §2 HF-gate missing-token E2E — would break other AlphaGenome work
  • §6 multi-oracle + advanced notebooks — need all 6 oracles loaded + GPU
  • §8 MCP E2E over stdio — ~4 min AlphaGenome predict
  • §13 real-oracle determinism — ~30 min across 6 loaded oracles
  • §14 remaining edge cases (indels, chrM, telomere) — need loaded oracles; §14.4 already fixed + regression-tested
  • §17 pip-audit — already advisory in CI per v20

Artefacts in audits/2026-04-21_v21_fresh_install/

  • report.md — summary + verbatim timings
  • screenshots/*.png (16 files; 18 reports with 2 basename collisions)
  • logs/00-10_*.txt + 09_quickstart_fresh.ipynb — pre-nuke state, post-nuke empty confirmation, CLI help, genome download trace, CDF fresh pull, device probe, API sanity, consistency greps, selenium output, fresh notebook, pytest log

Headline

Previous audits (v15–v20) + fixes in PRs #32, #34, #36 have driven this repo to a ship-clean state on every mechanisable item. What remains open is release-host work (full env build, multi-oracle notebooks, real-oracle determinism, MCP E2E) — items explicitly scoped for that environment.

Test plan

  • ~/.chorus/backgrounds/ + genomes/hg38.* deleted, confirmed empty before any check runs
  • chorus genome download hg38 downloaded, decompressed, indexed in ~9 min (see logs/03_genome_download.txt)
  • 6-oracle CDF fresh pull + sanity: 44 s total, all 6 pass monotonicity + ordering (see logs/04_cdf_fresh_download.txt)
  • Fresh jupyter nbconvert --execute on single_oracle_quickstart.ipynb: 0 errors, 0 warnings
  • Selenium-rendered all 18 HTMLs with --user-data-dir=<fresh tmpdir>: 0 JS errors per report
  • pytest fast suite: 335 passed / 1 skipped

🤖 Generated with Claude Code

Walked audits/AUDIT_CHECKLIST.md top-to-bottom with every
re-downloadable cache purged before starting:

- ~/.chorus/backgrounds/ (1.5 GB, 6 NPZ files) — nuked
- genomes/hg38.fa + .fai (3.0 GB) — nuked

Then re-downloaded via the documented flows:

- chorus genome download hg38 → 3.1 GB fresh, ~9 min
- get_normalizer(x) for each of 6 oracles → 1.5 GB via HuggingFace
  (huggingface.co/datasets/lucapinello/chorus-backgrounds), 44 s total

## Results — no new findings

§1  CLI PASS
§3  device: all 6 envs detect Metal/MPS/JAX-METAL on macOS arm64
§4  CDFs: 6/6 monotonic, p50<=p95<=p99, signed% correct (0/20/0/100/100/13)
§5  API: sequence_length matches spec for all 6; error messages clear
§6  notebook fresh exec: single_oracle_quickstart.ipynb 49 cells,
    0 errors, 0 warnings
§7  selenium: 18/18 HTMLs render with fresh Chrome profile,
    0 JS errors each
§10 consistency: 0 drift (grep for 5,930 / 7,612 / 196 kbp /
    examples/applications/ all empty)
§11 pytest: 335 passed / 1 skipped (8m 28s)
§14.4 chrom validation: fix still holds
§15 offline: 0 runtime CDN fetches in any HTML
§16 logging hygiene: no committed HF tokens or AWS keys
§18 LICENSE + docs/THIRD_PARTY.md + bundled IGV.js header intact

## Scope explicitly deferred to release-host audit

- §1 conda env recreate (80 GB / 2-4 h — destructive to ongoing work)
- §2 HF_TOKEN-missing end-to-end (would break other AlphaGenome work)
- §6 multi-oracle + advanced notebooks (need all 6 oracles loaded)
- §8 MCP E2E over stdio (~4 min AlphaGenome predict)
- §13 real-oracle determinism (~30 min, needs loaded models)
- §14 remaining edge cases (indels, chrM, telomere near-edge)
- §17 pip-audit (already advisory in CI per v20)

## Artefacts in audits/2026-04-21_v21_fresh_install/

- report.md — summary
- 16 selenium screenshots (1600×4500, 18 reports; 2 basename collisions)
- 11 log files: pre-nuke/post-nuke state, CLI help, genome download
  trace, CDF fresh pull + sanity, per-env device probe, Python API
  probe, consistency greps, selenium output, fresh notebook, pytest
  output

Headline: every mechanisable checklist item passes on a purged-cache
run. Previous audits (v15-v20) + fixes in PRs #32, #34, #36 have
driven this repo to a ship-clean state on items that can be tested
from macOS arm64. Release-host items are explicitly scoped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lucapinello
Copy link
Copy Markdown
Contributor Author

Superseded — artefacts cherry-picked into chorus-applications via 5300daa. The other agent's reconciliation commit imported the 18 selenium screenshots + 10 probe logs + fresh notebook + report.md from this branch, while skipping the 2 file reverts (LegNet 230→200 fix in comprehensive_oracle_showcase.ipynb, test_integration.py chrombpnet skip guard) that this branch would have undone.

Full reconciliation note lives at audits/2026-04-21_v21_fresh_install/report.md. Both v21 audits (this macOS arm64 data-cache-purge run + the Linux/CUDA fresh-env run at audits/2026-04-21_v21_fresh_install_linux_cuda.md) are preserved as complementary evidence.

Closing to keep the open-PR list clean. No content lost.

@lucapinello lucapinello deleted the audit/2026-04-21-v21-fresh-install-audit branch April 22, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant