Skip to content

Commit 016413d

Browse files
committed
Merge remote-tracking branch 'origin/audit/2026-04-21-v16-deep-notebook-html-pass' into chorus-applications
2 parents 0052930 + bd6c4fc commit 016413d

3 files changed

Lines changed: 93 additions & 6 deletions

File tree

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# v16 notebook + HTML report walkthrough — 2026-04-21
2+
3+
Walked the three `examples/notebooks/*.ipynb` cell-by-cell (every cell's
4+
source **and** execution output) and screenshotted eight of the shipped
5+
`examples/walkthroughs/**/*.html` reports via headless Chrome. Goal: catch
6+
anything a new user would see that is inconsistent, wrong, or confusing.
7+
8+
## Fixed in this PR
9+
10+
1. **Broken link in both multi-oracle notebooks.** Cell 0 of
11+
`comprehensive_oracle_showcase.ipynb` and
12+
`advanced_multi_oracle_analysis.ipynb` said
13+
`[`examples/walkthroughs/`](applications/)`. The label was correct
14+
but the URL still pointed at the old `applications/` directory (renamed
15+
to `walkthroughs/` in `340f30e`). Fixed to `../walkthroughs/`.
16+
2. **AlphaGenome track count drift.** Markdown in
17+
`comprehensive_oracle_showcase.ipynb` claimed "5,930 tracks across 11
18+
modalities" in three places (overview table, section 6 preamble, and a
19+
code comment in Operation 8). The running notebook prints
20+
`Loaded 5731 AlphaGenome tracks` and `Assay types: 7`. Fixed to
21+
"5,731 tracks" / "7 assay types" everywhere, and added PRO-CAP to the
22+
assay list in the overview table so it matches what `list_assay_types()`
23+
actually returns (`ATAC, CAGE, CHIP, DNASE, PRO_CAP, RNA, SPLICE_SITES`).
24+
3. **Enformer context-window typo.** `advanced_multi_oracle_analysis.ipynb`
25+
cell 9 called Enformer "a state-of-the-art sequence-to-activity model
26+
that can predict **5.313 human genomic tracks** with a context window
27+
of 196 kbp". The library reports `oracle.sequence_length == 393216`
28+
matches the comprehensive notebook's summary table (393 kb input). Also
29+
fixed the decimal separator (`5.313``5,313`). Changed to "393 kbp".
30+
31+
## Not fixed here — flagged for a follow-up
32+
33+
**Off-by-one in `predict_variant_effect`'s ref-allele check.** Both
34+
`single_oracle_quickstart.ipynb` (cell 39) and
35+
`comprehensive_oracle_showcase.ipynb` (cell 35) ship an execution output
36+
that contains
37+
38+
WARNING - Provided reference allele 'C' does not match the genome at
39+
this position ('T'). Chorus will use the provided allele.
40+
41+
for the test variant at `chrX:48786129`. The notebook code pulls the ref
42+
base via `extract_sequence('chrX:48786129-48786129')` which returns `'C'`
43+
(matches every external reference — UCSC, dbSNP, hg38 FASTA). But inside
44+
`chorus/core/base.py:322-328`:
45+
46+
real_pos = region_interval.ref2query(var_pos, ref_global=True)
47+
if region_interval[real_pos].upper() != ref_allele.upper():
48+
logger.warning(...)
49+
50+
the internal check reads the **next** base (`'T'`) instead of `'C'`. Same
51+
shift reproduces for the rs12740374 example: `chr1:109274968` is `G` by
52+
every external reference (and by `extract_sequence`), but
53+
`region_interval[ref2query(109274968, ref_global=True)]` returns `'T'`
54+
(the base at 109274969).
55+
56+
Consequence of the bug: every variant analysis in the shipped walkthroughs
57+
logs this warning even when the user provided the correct allele. More
58+
seriously, the reference allele substitution at line 330 happens at the
59+
shifted position, so the ref/alt predictions are computed at one base off
60+
the user's intended coordinate. Predicted effect *directions* still come
61+
out right (the walkthroughs all agree with Musunuru et al. 2010 for
62+
rs12740374), which is probably why this has been shipping unnoticed — the
63+
regulatory signal is coherent across ±1 bp in every case the library
64+
tests on.
65+
66+
Not fixing in this PR because it's a code-behaviour change (not a doc
67+
drift), and because any fix needs to be audited against what happens at
68+
indels and reverse-strand variants, not just SNVs. Best handled as its own
69+
focused PR.
70+
71+
## HTML reports screenshotted
72+
73+
Rendered at 1400×3000 via Chrome --headless and eye-reviewed. Nothing
74+
report-level was wrong:
75+
76+
- `walkthroughs/validation/SORT1_rs12740374_multioracle/rs12740374_SORT1_multioracle_report.html`
77+
- `walkthroughs/variant_analysis/SORT1_rs12740374/rs12740374_SORT1_alphagenome_report.html`
78+
- `walkthroughs/causal_prioritization/SORT1_locus/rs12740374_SORT1_locus_causal_report.html`
79+
- `walkthroughs/batch_scoring/batch_sort1_locus_scoring.html`
80+
- `walkthroughs/discovery/SORT1_cell_type_screen/chr1_109274968_G_T_SORT1_alphagenome_LNCaP_clone_FGC_report.html`
81+
- `walkthroughs/sequence_engineering/region_swap/region_swap_SORT1_K562_report.html`
82+
- `walkthroughs/validation/SORT1_rs12740374_with_CEBP/rs12740374_SORT1_CEBP_validation_report.html`
83+
84+
All display the glossary, formula chips, cross-layer tables, and
85+
(client-side JS) IGV browser placeholder correctly. The embedded IGV
86+
itself doesn't render in headless file:// mode — not a bug, it's a
87+
loader limitation for local previews.

examples/notebooks/advanced_multi_oracle_analysis.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
"> uses only Enformer and runs end-to-end with a single `chorus setup` call.\n",
2828
">\n",
2929
"> For **variant analysis** use-cases (the most common end-user task), the\n",
30-
"> [`examples/walkthroughs/`](applications/) folder has ready-to-run MCP + Claude\n",
30+
"> [`examples/walkthroughs/`](../walkthroughs/) folder has ready-to-run MCP + Claude\n",
3131
"> examples that do not require any notebook at all.\n"
3232
]
3333
},
@@ -206,7 +206,7 @@
206206
"id": "f9d58537",
207207
"metadata": {},
208208
"source": [
209-
"**Oracles** are the main objects in Chorus. They manages models that can make predictions on the genome. Let's start by downloading [Enformer](https://www.nature.com/articles/s41592-021-01252-x), a state-of-the-art sequence-to-activity models that can predict **5.313 human genomic tracks** with a context window of 196 kbp."
209+
"**Oracles** are the main objects in Chorus. They manages models that can make predictions on the genome. Let's start by downloading [Enformer](https://www.nature.com/articles/s41592-021-01252-x), a state-of-the-art sequence-to-activity models that can predict **5,313 human genomic tracks** with an input window of 393 kbp."
210210
]
211211
},
212212
{

examples/notebooks/comprehensive_oracle_showcase.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
"> uses only Enformer and runs end-to-end with a single `chorus setup` call.\n",
2828
">\n",
2929
"> For **variant analysis** use-cases (the most common end-user task), the\n",
30-
"> [`examples/walkthroughs/`](applications/) folder has ready-to-run MCP + Claude\n",
30+
"> [`examples/walkthroughs/`](../walkthroughs/) folder has ready-to-run MCP + Claude\n",
3131
"> examples that do not require any notebook at all.\n"
3232
]
3333
},
@@ -47,7 +47,7 @@
4747
"| **ChromBPNet** | TensorFlow | 2 kb | 1 bp | Per-model (DNASE, ATAC) |\n",
4848
"| **Sei** | PyTorch | 4 kb | 128 bp | 21,907 (histone marks, TF binding, sequence classes) |\n",
4949
"| **LegNet** | PyTorch | 230 bp | 1 bp | 1 (LentiMPRA) |\n",
50-
"| **AlphaGenome** | JAX | 1 Mb | 1 bp | 5,930 (DNASE, CAGE, CHIP, ATAC, RNA, splice sites) |\n",
50+
"| **AlphaGenome** | JAX | 1 Mb | 1 bp | 5,731 (DNASE, CAGE, CHIP, ATAC, RNA, PRO-CAP, splice sites) |\n",
5151
"\n",
5252
"**Operations demonstrated:**\n",
5353
"1. Wild-type prediction (`predict`)\n",
@@ -1052,7 +1052,7 @@
10521052
"---\n",
10531053
"## 6. AlphaGenome — JAX, 1 Mb context, single bp resolution\n",
10541054
"\n",
1055-
"AlphaGenome (Google DeepMind, Nature 2026) is the latest and most comprehensive genomics model. It predicts 5,930 tracks across 11 modalities at single base-pair resolution from up to 1 Mb of DNA sequence."
1055+
"AlphaGenome (Google DeepMind, Nature 2026) is the latest and most comprehensive genomics model. It predicts 5,731 tracks across 7 assay types at single base-pair resolution from up to 1 Mb of DNA sequence."
10561056
]
10571057
},
10581058
{
@@ -2228,7 +2228,7 @@
22282228
],
22292229
"source": [
22302230
"# Step 1: Select biologically appropriate hepatocyte tracks from AlphaGenome\n",
2231-
"# AlphaGenome has 5,930 tracks — we need liver-relevant ones for an LDL cholesterol variant\n",
2231+
"# AlphaGenome has 5,731 tracks — we need liver-relevant ones for an LDL cholesterol variant\n",
22322232
"\n",
22332233
"ag_hepatocyte_tracks = {\n",
22342234
" # Layer 1: Chromatin accessibility\n",

0 commit comments

Comments
 (0)