|
| 1 | +# v16 notebook + HTML report walkthrough — 2026-04-21 |
| 2 | + |
| 3 | +Walked the three `examples/notebooks/*.ipynb` cell-by-cell (every cell's |
| 4 | +source **and** execution output) and screenshotted eight of the shipped |
| 5 | +`examples/walkthroughs/**/*.html` reports via headless Chrome. Goal: catch |
| 6 | +anything a new user would see that is inconsistent, wrong, or confusing. |
| 7 | + |
| 8 | +## Fixed in this PR |
| 9 | + |
| 10 | +1. **Broken link in both multi-oracle notebooks.** Cell 0 of |
| 11 | + `comprehensive_oracle_showcase.ipynb` and |
| 12 | + `advanced_multi_oracle_analysis.ipynb` said |
| 13 | + `[`examples/walkthroughs/`](applications/)`. The label was correct |
| 14 | + but the URL still pointed at the old `applications/` directory (renamed |
| 15 | + to `walkthroughs/` in `340f30e`). Fixed to `../walkthroughs/`. |
| 16 | +2. **AlphaGenome track count drift.** Markdown in |
| 17 | + `comprehensive_oracle_showcase.ipynb` claimed "5,930 tracks across 11 |
| 18 | + modalities" in three places (overview table, section 6 preamble, and a |
| 19 | + code comment in Operation 8). The running notebook prints |
| 20 | + `Loaded 5731 AlphaGenome tracks` and `Assay types: 7`. Fixed to |
| 21 | + "5,731 tracks" / "7 assay types" everywhere, and added PRO-CAP to the |
| 22 | + assay list in the overview table so it matches what `list_assay_types()` |
| 23 | + actually returns (`ATAC, CAGE, CHIP, DNASE, PRO_CAP, RNA, SPLICE_SITES`). |
| 24 | +3. **Enformer context-window typo.** `advanced_multi_oracle_analysis.ipynb` |
| 25 | + cell 9 called Enformer "a state-of-the-art sequence-to-activity model |
| 26 | + that can predict **5.313 human genomic tracks** with a context window |
| 27 | + of 196 kbp". The library reports `oracle.sequence_length == 393216` — |
| 28 | + matches the comprehensive notebook's summary table (393 kb input). Also |
| 29 | + fixed the decimal separator (`5.313` → `5,313`). Changed to "393 kbp". |
| 30 | + |
| 31 | +## Not fixed here — flagged for a follow-up |
| 32 | + |
| 33 | +**Off-by-one in `predict_variant_effect`'s ref-allele check.** Both |
| 34 | +`single_oracle_quickstart.ipynb` (cell 39) and |
| 35 | +`comprehensive_oracle_showcase.ipynb` (cell 35) ship an execution output |
| 36 | +that contains |
| 37 | + |
| 38 | + WARNING - Provided reference allele 'C' does not match the genome at |
| 39 | + this position ('T'). Chorus will use the provided allele. |
| 40 | + |
| 41 | +for the test variant at `chrX:48786129`. The notebook code pulls the ref |
| 42 | +base via `extract_sequence('chrX:48786129-48786129')` which returns `'C'` |
| 43 | +(matches every external reference — UCSC, dbSNP, hg38 FASTA). But inside |
| 44 | +`chorus/core/base.py:322-328`: |
| 45 | + |
| 46 | + real_pos = region_interval.ref2query(var_pos, ref_global=True) |
| 47 | + if region_interval[real_pos].upper() != ref_allele.upper(): |
| 48 | + logger.warning(...) |
| 49 | + |
| 50 | +the internal check reads the **next** base (`'T'`) instead of `'C'`. Same |
| 51 | +shift reproduces for the rs12740374 example: `chr1:109274968` is `G` by |
| 52 | +every external reference (and by `extract_sequence`), but |
| 53 | +`region_interval[ref2query(109274968, ref_global=True)]` returns `'T'` |
| 54 | +(the base at 109274969). |
| 55 | + |
| 56 | +Consequence of the bug: every variant analysis in the shipped walkthroughs |
| 57 | +logs this warning even when the user provided the correct allele. More |
| 58 | +seriously, the reference allele substitution at line 330 happens at the |
| 59 | +shifted position, so the ref/alt predictions are computed at one base off |
| 60 | +the user's intended coordinate. Predicted effect *directions* still come |
| 61 | +out right (the walkthroughs all agree with Musunuru et al. 2010 for |
| 62 | +rs12740374), which is probably why this has been shipping unnoticed — the |
| 63 | +regulatory signal is coherent across ±1 bp in every case the library |
| 64 | +tests on. |
| 65 | + |
| 66 | +Not fixing in this PR because it's a code-behaviour change (not a doc |
| 67 | +drift), and because any fix needs to be audited against what happens at |
| 68 | +indels and reverse-strand variants, not just SNVs. Best handled as its own |
| 69 | +focused PR. |
| 70 | + |
| 71 | +## HTML reports screenshotted |
| 72 | + |
| 73 | +Rendered at 1400×3000 via Chrome --headless and eye-reviewed. Nothing |
| 74 | +report-level was wrong: |
| 75 | + |
| 76 | +- `walkthroughs/validation/SORT1_rs12740374_multioracle/rs12740374_SORT1_multioracle_report.html` |
| 77 | +- `walkthroughs/variant_analysis/SORT1_rs12740374/rs12740374_SORT1_alphagenome_report.html` |
| 78 | +- `walkthroughs/causal_prioritization/SORT1_locus/rs12740374_SORT1_locus_causal_report.html` |
| 79 | +- `walkthroughs/batch_scoring/batch_sort1_locus_scoring.html` |
| 80 | +- `walkthroughs/discovery/SORT1_cell_type_screen/chr1_109274968_G_T_SORT1_alphagenome_LNCaP_clone_FGC_report.html` |
| 81 | +- `walkthroughs/sequence_engineering/region_swap/region_swap_SORT1_K562_report.html` |
| 82 | +- `walkthroughs/validation/SORT1_rs12740374_with_CEBP/rs12740374_SORT1_CEBP_validation_report.html` |
| 83 | + |
| 84 | +All display the glossary, formula chips, cross-layer tables, and |
| 85 | +(client-side JS) IGV browser placeholder correctly. The embedded IGV |
| 86 | +itself doesn't render in headless file:// mode — not a bug, it's a |
| 87 | +loader limitation for local previews. |
0 commit comments