You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MSK 30polyKO (669M reads, 3 libraries, ~38K genes + 30 gRNA features + 245,979 LARRY barcodes)
Phase
Start
End
Duration
Genome load
07:29:39
07:30:27
48s
Feature assignment
07:30:27
07:50:17
19m 50s
Mapping
07:30:27
07:45:09
14m 42s
Solo counting
07:45:10
07:54:39
9m 29s
PfMulti merge + CRISPR calling
07:54:39
07:56:38
1m 59s
Note: Feature assignment and mapping run concurrently via dynamicThreadInterface.
Parity vs CellRanger 9
All parity metrics computed with scripts/report_additional_parity_metrics.py
using --gene-corr-min-counts 20 --gene-corr-min-cells-pct 0.01 per
docs/PAPER_BENCHMARK_METHODOLOGY.md. CR9 references use
refdata-gex-GRCh38-2024-A (gencode v44, mkref 8.0.0).
Dataset
Cells (STAR / CR)
Jaccard
Gene Pearson
Cell Pearson
CRISPR match
UMI Pearson
A375
1,187 / 1,162
0.976
0.975 (15,673 genes)
0.9995 (1,160 BCs)
100% (1,083/1,083)
1.000
UCSF EBs2_2
13,721 / 13,760
0.976
0.995 (18,061 genes)
1.000 (13,571 BCs)
98.9% (11,902/12,038)
0.999
MSK 30polyKO
30,567 / 32,256
0.942
0.994 (17,460 genes)
1.000 (30,481 BCs)
99.4% (23,210/23,341)
1.000
CR9 Reference Runs
Dataset
CR9 run ID
Reference
CR9 wall time
A375
1k_CRISPR_5p_gemx_count_refmatch_2024a_fullraw
refdata-gex-GRCh38-2024-A
~15 min
UCSF EBs2_2
cr9_ebs2_2 (run 2026-03-18)
refdata-gex-GRCh38-2024-A
58 min
MSK 30polyKO
cr9_starindex_grna (GEX+gRNA only)
refdata-gex-GRCh38-autoindex11044
~58+110 min
Full parity output files: {a375,ucsf_ebs2_2,msk_30polyko}/parity_vs_cr9.txt.
Regression Check vs Previous Benchmarks (2026-03-17)
A375: No regression
Metric
Old (Mar 17)
New (Mar 18)
Wall time
245s
241s (-2%)
Cells
1,187
1,187
Filtered MEX delta (old vs new)
—
1 count / 24.4M total
CRISPR calls (old vs new)
—
1187/1188 lines identical (1 off by 1 UMI)
GEX vs CR ref (filtered, common BCs)
delta 19,284
delta 19,284 (identical)
CRISPR vs CR ref (filtered, common BCs)
delta 48,574
delta 48,575 (off by 1)
MSK 30polyKO: Major CRISPR improvement (namespace fix)
Metric
Old (Mar 17)
New (Mar 18)
Wall time
2516s (with BAM)
1656s (no BAM)
Cells
30,520
30,567 (+47, ED variance)
CRISPR: cells with 0 molecules
28,550 (93.5%)
233 (0.8%)
CRISPR: cells with 1+ call
691
23,546
Filtered MEX barcodes (old vs new)
30,520
30,567
Common barcodes
—
30,499
Total feature count delta (old vs new)
—
+5.0M (from recovered CRISPR)
The old run suffered from the NXT namespace bug: gRNA barcodes were matched in
the wrong namespace, causing 93.5% of cells to show zero CRISPR guide molecules.
The new run, with per-library whitelist support and deterministic namespace
normalization, correctly assigns guides to 23,546 cells.
UCSF EBs2_2: First run on this branch (no prior baseline)
Establishes the baseline for 2-library NXT perturb-seq with CRISPRa v2 guides.
Solo GEX Statistics
A375
Metric
Value
Reads
47,095,182
Valid barcodes
89.9%
Sequencing saturation
21.9%
Uniquely mapped
74.1%
Cells
1,187
Median UMI/cell
17,885
Median genes/cell
5,562
UCSF EBs2_2
Metric
Value
Reads
444,896,731
Valid barcodes
97.6%
Sequencing saturation
29.8%
Uniquely mapped
96.6%
Cells
13,721
Median UMI/cell
15,431
Median genes/cell
5,223
MSK 30polyKO
Metric
Value
Reads
668,705,043
Valid barcodes
95.8%
Sequencing saturation
31.8%
Uniquely mapped
93.3%
Cells
30,567
Median UMI/cell
8,408
Median genes/cell
3,933
CRISPR GMM Calling Summary
A375 (11 guides, minUMI=10)
Metric
Value
Total cells
1,187
Cells with 0 molecules
16
Cells with no call
85
Cells with 1 feature
1,051
Cells with >1 features
35
UCSF EBs2_2 (548 CRISPRa guides, minUMI=3)
Metric
Value
Total cells
13,721
Cells with 0 molecules
435
Cells with no call
1,191
Cells with 1 feature
1,183
Cells with >1 features
10,912
MSK 30polyKO (30 gRNA guides, minUMI=2)
Metric
Value
Total cells
30,567
Cells with 0 molecules
233
Cells with no call
6,788
Cells with 1 feature
20,972
Cells with >1 features
2,574
Key Code Changes Since Previous Benchmarks
Namespace/whitelist correctness (primary): Per-library star_whitelist
in pfMultiConfig, deterministic NXT→TRU normalization in assignBarcodes and
PfMultiMerge. Fixes the MSK gRNA zero-molecule bug.
PfMultiMerge streaming optimization (secondary): Direct gzip writes via
gzwrite replacing the write-plaintext-then-recompress cycle. Vector-based
O(1) barcode remapping replacing std::map. unordered_map for barcode
lookups. Zero-count feature pruning.
assignBarcodes fastHamming hardening: pf_hamming_search_fasthamming
brute-force fallback only activates for maxHammingDistance > 2, preventing
performance regression on standard prehash tiers.
PE Bulk Benchmark (Integrated STAR-suite vs External Stepwise Pipeline)
Benchmark script: scripts/paper/run_pe_bulk_feature_benchmark.sh
Dataset: JAX PE (21033-09-01-13-01_S1_L007), full sample on /storage, 32 threads.
STAR index: /storage/autoindex_110_44/bulk_index
With Y-removal (2026-03-10)
Step
Integrated
External
STAR (trim + align + Y-split + TranscriptVB)
29.20s
—
Salmon QC
31.51s
31.03s
Decompress
—
13.44s
Trimvalidate
—
16.89s
STAR align
—
49.94s
remove_y_reads
—
14.03s
Total
60.71s
125.33s
Speedup
2.1x
—
Without Y-removal (2026-03-18)
Step
Integrated
External
STAR (trim + align + TranscriptVB)
5.69s
—
Salmon QC
31.00s
30.49s
Decompress
—
12.79s
Trimvalidate
—
14.36s
STAR align
—
29.65s
Total
36.69s
87.29s
Speedup
2.4x
—
Note: The low 5.69s integrated STAR time in the no-Y run reflects a warm page cache
(genome loaded by the preceding downsampled stage). The Y-removal run's 29.20s
is a cold-cache measurement and is more representative for single-invocation use.
Each subdirectory contains a RUN_COMMAND.sh with the exact STAR invocation.
The benchmark scripts are also included for the full wrapper (FASTQ discovery,
multi-config generation, output validation).
Build: make -C core/legacy/source clean && make -C core/legacy/source -j8 STAR
Branch: multi-feature at commit used for this run.