STAR Suite updates the original STAR aligner by integrating four modules — STAR-perturb, STAR-Flex, STAR-SLAM, and TranscriptVB — to provide complete internal C/C++ pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. The integration results in substantial speedups (1.7–2.4x for bulk RNA-seq, 1.5–2.0x for scRNA-seq GEX-only Solo vs CellGENI-style STARsolo, 3.2–6.1x for Perturb-seq, 2.5–28.8x for Flex) and a simplified toolchain that can be installed through pre-compiled binaries for researchers and agents. No new external dependencies are required; the suite is built entirely with the existing STAR toolchain and vendored components. This is a drop-in replacement for the STAR aligner.
STAR Suite supports partial compilation: build only the module/tool targets you need instead of building the full suite every time.
Agent quickstart: see AGENTS.md for repo-specific guardrails, tests, and recent changes.
- Speedup: Bulk RNA-seq 1.7–2.4x faster than external stepwise pipelines; scRNA-seq GEX-only Solo 1.5–2.0x faster than the CellGENI-style STARsolo parameter surface (UCSF 14K cells 1.95x, MSK 30K cells 1.52x); Perturb-seq 3.2–6.1x faster than Cell Ranger 9; Flex 2.5x faster than Cell Ranger 9 (5.7x in no-align mode, ~12.8–28.8x vs Cell Ranger 7 with BAM) — all with near-identical parity.
- Batch Mode (
--batchMode 1): Processes multiple FASTQs in one STAR invocation while reusing the loaded genome. Removes the need for--genomeLoadkeep-in-memory workflows. Single-pass only (no--twopassMode); not supported with Solo (--soloType). Use--outFileNamePrefixAuto 1for per-sample subdirectories. - TranscriptVB Quantification (
--quantMode TranscriptVB): Variational Bayes and EM quantification for transcript-level abundance, with parity-oriented behavior against Salmon alignment-mode. Gene-level summarization via--quantVBgenesMode Tximport. - Transcriptome Output (
--quantTranscriptomeSAMoutput): Replaces the former--quantTranscriptomeBanwith more explicit control (e.g.,BanSingleEnd_ExtendSoftclip). - Reference Automation (
--autoIndex Yes): Automated reference download/build with--cellrangerStyleIndex Yesformatting and--genomeGenerateTranscriptome Yesfor transcript-level quant workflows. - Native Gzip FASTQ Handling: Automatic detection of
.gzFASTQ inputs with internal zlib streaming — no--readFilesCommand zcatneeded for correctness. Performance tuning of the internal gzip path is currently specific to Flex, where it is faster than externalzcat; on the other benchmarked alignment/Solo surfaces below, externalzcatremains the faster validated path. Legacy external helper remains available via--readFilesLegacyZcat Yes. - Cutadapt-Compatible Trimming (
--trimCutadapt Yes): Native cutadapt-style trimming for bulk/PE workflows. Compatibility mode:--trimCutadaptCompat Cutadapt3. - Poly-G Trimming (
--clip3pPolyG yes|no|auto): Trims poly-G artifacts common on NovaSeq/NextSeq platforms. Defaultautoactivates in CellRanger4 mode. Without this, poly-G reads can inflate specific genes (e.g., LINC00486) and degrade gene-level correlations. - Samtools-style BAM Sorting (
--outBAMsortMethod samtools): Spill-to-disk sort to reduce peak RAM pressure. Works with all modes including Flex. - Y/NoY Separation (
--emitNoYBAM yes,--emitYNoYFastq yes): Split BAM and FASTQ outputs by chrY alignment. Works with bulk, single-cell, and Flex. - EmptyDrops_CR Integration: CR-compatible EmptyDrops path (including libscrna-backed behavior in scRNA/perturb flows).
- Solo Features:
sFBAM tag for feature type,--soloCBtype Stringfor arbitrary barcode strings,--soloCellReadStats Standardfor improved cell filtering. - CR-compat GEX (
--soloCrGexFeature auto|gene|genefull): Controls which GEX source is merged in CR-compat mode. - CB/UB Tag Pairing (
--soloCbUbRequireTogether yes|no): Enforce CB/UB tag pairing for tag injection (defaultyes).
core/
legacy/ # Upstream STAR layout (single source of truth)
features/ # Shared overlays and feature tooling
process_features/ # Perturb feature extraction/calling implementation
feature_barcodes/ # Standalone barcode tools (assignBarcodes, demux)
libscrna/ # EmptyDrops/OrdMag/Occupancy shared library
flex/ # Flex-specific code + tools
slam/ # SLAM-seq code + tools
build/ # Modular make fragments
scripts/ # Suite-level helper scripts (see scripts/README.md)
docs/ # Suite-level docs
tests/ # Suite-level tests (see tests/ARTIFACTS.md for artifact locations)
tools/ # Suite-level scripts/utilities
mcp_server/ # MCP server for scripted discovery/preflight/run workflows
- STAR-core (
core/): Legacy STAR (indexing, bulk, Solo) plus shared utilities. Build:make core(binary atcore/legacy/source/STAR). - STAR-perturb (
core/legacy/+core/features/process_features/): CR-compatible perturb-seq path with integrated feature extraction/calling (process_features+call_features) andcrispr_analysis/outputs in CR-compat mode. Primary run path:STAR --pfMultiConfig ... --defaultCrCompat yes(see STAR-perturb section below). - STAR-Flex (
flex/): FlexFilter pipeline and Flex-specific integrations. Build tools:make flexormake flex-tools. - STAR-SLAM (
slam/): SLAM-seq quantification, SNP masking, trimming/QC. Build tools:make slamormake slam-tools. - Feature Barcodes (
core/features/feature_barcodes/): Standalone barcode tools (assignBarcodes,demux_bam,demux_fastq) for perturb-seq testing. Build tools:make feature-barcodes-tools. - Process Features (
core/features/process_features/): Full feature extraction/calling pipeline (assignBarcodes,call_features,demux_bam,demux_fastq) and standalone tool (star_feature_call). Build tools:make process-features-tools,make star-feature-call. - Shared Feature Toolchains (
core/features/): Reusable tool layers used across modules, includingvbem(TranscriptVB helpers),yremove_*(Y/noY splitting),bamsort, andlibscrna. Build tools:make vbem-tools,make yremove-tools, plus in-core integrations. - MCP Server (tooling) (
mcp_server/): Agent automation service for dataset/test discovery and controlled execution (list_datasets,list_test_suites,preflight,run_script,collect_outputs). This is repo tooling, not an analysis module. - Helper Scripts (
scripts/): Standalone Python and Bash tools for FASTQ preflight, QC, parity benchmarking, downstream h5ad processing, and fixture management. These are not compiled into STAR; they run independently. Highlights includepreflight_library_pairing.py(chemistry detection and library pairing for mislabeled Perturb-seq),report_additional_parity_metrics.py(STAR vs CR parity), andbuild_gene_full_velocyto_h5ad.py(Velocyto h5ad packaging). Seescripts/README.mdfor the full catalogue.
All benchmarks run on pikachu (AMD, 32 threads, 128 GB RAM, NVMe SSD). The table below keeps the README focused on the headline results. Publication-facing wrappers live in publications/benchmarks/README.md, archived benchmark artifacts live in comparisons/paper_benchmarks_20260318/README.md, and detailed Velocyto bridge results live in docs/VELOCYTO_BENCHMARKS.md.
| Workflow | Dataset / surface | Baseline | STAR-suite result | Key parity / note |
|---|---|---|---|---|
| Bulk RNA-seq | JAX PE 6.5M | External stepwise (Trim Galore + STAR + Salmon) | 37 s without Y-removal, 61 s with Y-removal; 2.4x / 2.1x faster | Transcript Pearson 0.995, gene Pearson 0.997 vs Salmon |
| Bulk RNA-seq | PPARG PE 35.1M | External stepwise (Trim Galore + STAR + Salmon) | 9m 35s without Y-removal, 11m 58s with Y-removal; 1.7x / 2.1x faster | Same integrated trim + align + TranscriptVB path |
| scRNA-seq Solo | UCSF EBs2_2 GEX-only |
Historical CellGENI-style STARsolo (7a7fb08) |
13.75 min optimized zcat vs 26.8 min historical; 1.95x faster |
13,723 cells, gene Pearson 0.994885 vs CR9; current zcat surface is the best validated UCSF run |
| scRNA-seq Solo | MSK 30polyKO GEX-only | Historical CellGENI-style STARsolo (7a7fb08) |
19.40 min modern vs 29.52 min historical; 1.52x faster | 30,562 cells, gene Pearson 0.994575 vs CR9 |
| Perturb-seq | A375 1k CRISPR 5' GemX | Cell Ranger 9 | 4.0 min; 3.8x faster | Jaccard 0.976, gene Pearson 0.975, CRISPR match 100% |
| Perturb-seq | UCSF EBs2_2 |
Cell Ranger 9 | 16.4 min; 3.7x faster | Jaccard 0.976, gene Pearson 0.995, CRISPR match 98.9% |
| Perturb-seq | MSK 30polyKO | Cell Ranger 9 (separate GEX+gRNA and GEX+LARRY runs) | 25.0 min; 6.7x faster | Jaccard 0.942, gene Pearson 0.994, CRISPR match 99.4% |
| Flex | JAX SC2300771 full / no-align | Cell Ranger 9 / 7 | 23m 30s full (2.5x vs CR9), 10m 26s no-align (5.7x vs CR9) | Mean Jaccard 0.981, cell Pearson 0.99997, gene Pearson 0.99993 vs CR9 |
| SLAM-seq | NW-5-21 ARID1A compat mode | GEDI / GRAND-SLAM family | Integrated single-pass alignment + quantification; no apples-to-apples end-to-end wall-time claim reported | NTR Pearson 0.967-0.978, Spearman 0.985-0.990 vs GEDI |
Perturb-seq is the main performance result: on A375, UCSF, and MSK surfaces, STAR-suite runs 3.8x-6.7x faster than Cell Ranger 9 while maintaining near-identical GEX/cell metrics and 98.9-100% CRISPR call agreement.
For non-Flex Solo, the README now summarizes only the historical CellGENI-style
baseline versus the current optimized surface. On this host, external zcat
remains the fastest validated read path for UCSF/MSK GEX-only and perturb runs;
native .gz input is functional but not yet the fastest on those surfaces.
All perturb parity metrics above were computed with
scripts/report_additional_parity_metrics.py --gene-corr-min-counts 20 --gene-corr-min-cells-pct 0.01
per docs/PAPER_BENCHMARK_METHODOLOGY.md. CR9 references use
refdata-gex-GRCh38-2024-A unless noted otherwise in the archived benchmark
artifacts.
# Core STAR binary
make core
# Module-focused builds
make flex # core + Flex tools
make slam # core + SLAM tools
# Individual tool targets
make feature-barcodes-tools # assignBarcodes/demux (standalone)
make process-features-tools # full process_features pipeline
make star-feature-call # standalone feature caller
make vbem-tools # TranscriptVB helpers
make yremove-tools # Y/noY splitting tools
# Default build (core + common tools)
make # or: make default
# Build everything
make allSelective filtering:
make default INCLUDE="core slam-tools"
make default EXCLUDE="flex-tools"Run make help to see the full target list and descriptions.
# Ubuntu package from a local artifact
sudo apt install ./star-suite_<version>_<arch>.deb
# Installer tarball (auto-detects host glibc level)
tar -xzf STAR-suite-<version>-linux-<arch>-installer.tar.gz
cd STAR-suite-<version>-linux-<arch>-installer
./install.sh
# Manual compatibility tarball
tar -xzf STAR-suite-<version>-linux-<arch>-glibc234.tar.gz
cd STAR-suite-<version>-linux-<arch>-glibc234
./install.shRelease tarballs are validated in clean Ubuntu 22.04 and 24.04 Docker containers before publication. The installer bundle auto-detects the host glibc level and chooses the right bundled binary.
Packaging/release details and artifact policy:
docs/Star-binary-distribution.mddocs/Github-actions.md
Compilation details (module-by-module, clean rebuilds, and clean Ubuntu 24.04 validation):
docs/compile_instructions.md
A multi-stage Docker setup (Ubuntu 24.04) provides a clean build environment and separate runtime/test images.
Builder stage: Compiles STAR Suite from source with no host leakage. Validates make core, flex, slam, feature-barcodes-tools, default, and all.
Suite base runtime (suite-base): Minimal executable image with suite binaries (e.g. STAR) and no Python/test-only helpers.
Test images (built from suite-base):
test-tier-a: self-contained smoke helpers.test-tier-b: fixture-backed helper stack (e.g.python3,bc,samtools).
# Build suite base image (default tag: biodepot/star-suite:latest)
./scripts/docker/build_image.sh
# Override tag or parallel jobs
IMAGE_TAG=myorg/star-suite:v1 MAKE_JOBS=8 ./scripts/docker/build_image.sh
# Reproducibility check: force a clean rebuild (no cache)
docker build --no-cache --target suite-base -f docker/Dockerfile -t biodepot/star-suite:latest --build-arg MAKE_JOBS=8 .
# Run STAR from suite base image
docker run --rm biodepot/star-suite:latest
# Run Tier A smoke tests (builds/uses test-tier-a image)
./scripts/docker/run_smokes_tier_a.sh
# Run Tier B smoke tests (builds/uses test-tier-b image; requires fixtures)
./scripts/docker/run_smokes_tier_b.shTier B tests require data under /storage. Mount your fixture root:
docker run --rm -v /path/to/your/data:/storage biodepot/star-suite:test-tier-b bash -c "tests/run_cbub_regression_test.sh"By default, ./scripts/docker/run_smokes_tier_b.sh uses STORAGE=/storage.
Set STORAGE=/path to override (script uses it for the -v mount).
Expected layout: /storage/A375, /storage/flex_filtered_reference, etc. See plans/docker_plan.md for full fixture roots.
Smoke tests honor STAR_BIN to decouple from source-relative paths. Docker smoke wrappers set STAR_BIN=/usr/local/bin/STAR automatically.
See docs/docker_validation.md for the latest portability check results.
This section documents the key features and flags for each module. For standard STAR flags not listed here, see core/legacy/README.md. Core additions are listed above in Core Additions over STAR 2.7.11b.
See flex/README_flex.md for the full pipeline reference.
STAR-Flex uses a pseudo-chromosome alignment approach: probe sequences are embedded as pseudo-chromosomes in a hybrid reference genome, and STAR's native alignment machinery handles gene assignment. Core features (trimming, spill-to-disk sorting, Y-chromosome splitting, TranscriptVB) all work with Flex.
Key flags:
--flex yes: Enable Flex pipeline.--soloFlexExpectedCellsPerTag: Expected cells per sample tag.--soloSampleWhitelist: TSV mapping sample tags to labels.--soloProbeList: Probe gene list (auto-detected from index if omitted).--soloSampleProbes: 10x probe barcode sequences file.
Features:
- Sample tag detection, 1MM pseudocount correction for CBs, clique-based UMI deduplication, and occupancy filtering.
- Y-chromosome splitting tested and validated (
tests/TEST_REPORT_Y_SPLIT_FLEX.md).
See slam/docs/SLAM_COMPATIBILITY_MODE.md and slam/docs/SLAM_seq.md.
Integrated SLAM-seq quantification with GRAND-SLAM parity:
Key flags:
--slamQuantMode 1: Enable SLAM quantification.--slamGrandSlamOut 1: Generate GRAND-SLAM compatible output.--slamCompatMode gedi: Enable GEDI compatibility (intronic classification, lenient overlap, overlap weighting).--slamCompatIntronic,--slamCompatLenientOverlap: Fine-grained compat control.--autoTrim variance: Variance-based detection of artifact-prone read ends.--slamTrim5p,--slamTrim3p: Manual trim guards.--slamErrorRateFromBlank 1: Seed error rate from a blank (e.g. no4sU) sample.--outFileNamePrefixAuto 1: Derive sample name from first FASTQ and route outputs into subdirs.--slamDumpBinary 1 --slamDumpWeights 1: Emit binary dumps for offline re-quantification withslam_requant.
Features:
- Full gene-level NTR estimation (Binomial/EM models).
- Auto-trimming: variance-based detection of artifact-prone read ends.
- QC: comprehensive interactive HTML reports for T->C rates and error modeling.
- Batch layout organizes outputs into
alignments/,counts/,qc/,y_separated/. - Binary dump format documented in
slam/docs/SLAM_DUMP_FORMAT.md.
See docs/feature_barcodes.md and docs/CRISPR_FEATURE_CALLING_IMPLEMENTATION_SUMMARY.md.
CR-compatible Solo behavior with integrated CRISPR feature calling:
Key flags:
--pfMultiConfig: Cell Ranger-style multi processing with feature libraries.--defaultCrCompat yes: Apply the CR-compat perturb defaults bundle.--dynamicThreadInterface 1: Enable STAR/PF permit coordination.--dynamicThreadConstMapPermits 32: Start with full map-side permit budget.--crAssignConsumerThreads 32: Provision PF worker pool to full host budget.--crAssignSearchThreads 1: Per-consumer search-thread mode.--crMinUmi: Minimum UMI threshold for CRISPR feature calling (default10; lower to2-3for lineage barcodes).--soloCrGexFeature: Control merged GEX source (auto,gene,genefull).--soloCrMode CR: Enable CR-compatible single-cell behavior.--crChemistry: Barcode chemistry (auto,NXT,TRU). Defaultautoenables per-library auto-detection. Mixed NXT/TRU experiments are handled automatically; per-library overrides via thestar_chemistrycolumn in--pfMultiConfig.
Recommended execution profile (32-thread host):
--runThreadN 32 --dynamicThreadInterface 1 --dynamicThreadConstMapPermits 32 \
--dynamicThreadTelemetry 1 --crAssignConsumerThreads 32 --crAssignSearchThreads 1Standalone tool (star_feature_call):
--compat-perturb: CR9-compatible output layout (crispr_analysis/).--feature-ref,--whitelist,--fastq-dir,--output-dir: FASTQ -> MEX -> calls.--call-only --mex-dir: call_features-only pass on existing MEX.--emptydrops-use-fdr,--min-umi,--ratio-test: calling controls.
- SLAM QC (
--slamQcReport <prefix>): Interactive HTML report (.html) and JSON metrics (.json) for T->C conversion rates, variance analysis, and trimming overlays. - FlexFilter QC (
flexfilter_summary.tsv): Cell calling statistics (EmptyDrops/OrdMag), cell counts, UMI thresholds, and filtering rates per sample.
Core alignment:
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/genome_index \
--readFilesIn reads.fq.gz \
--outFileNamePrefix out/ \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MDBatch mode (bulk, single-pass, SE):
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/genome_index \
--readFilesIn A_R1.fq.gz,B_R1.fq.gz \
--outFileNamePrefix /path/to/out_root/ \
--outFileNamePrefixAuto 1 \
--batchMode 1 \
--outSAMtype BAM SortedByCoordinateBatch mode (bulk, single-pass, PE):
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/genome_index \
--readFilesIn A_R1.fq.gz,B_R1.fq.gz A_R2.fq.gz,B_R2.fq.gz \
--outFileNamePrefix /path/to/out_root/ \
--outFileNamePrefixAuto 1 \
--batchMode 1 \
--outSAMtype BAM SortedByCoordinateFlex Mode (10x Fixed RNA Profiling):
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/flex_index \
--readFilesIn reads_R2.fq.gz reads_R1.fq.gz \
--flex yes \
--soloType CB_UMI_Simple \
--soloCBwhitelist /path/to/737K-fixed-rna-profiling.txt \
--soloSampleWhitelist sample_whitelist.tsv \
--outFileNamePrefix output/SLAM Mode (Standard):
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/genome_index \
--readFilesIn reads.fq.gz \
--outFileNamePrefix out/ \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD \
--slamQuantMode 1 \
--slamSnpBed /path/to/snps.bedSLAM Mode (GEDI Compatibility):
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/genome_index \
--readFilesIn reads.fq.gz \
--slamQuantMode 1 \
--slamCompatMode gedi \
--autoTrim variance \
--outFileNamePrefix output/SLAM Batch Mode (blank-first, SE/PE):
core/legacy/source/STAR \
--runMode alignReads \
--genomeDir /path/to/genome_index \
--readFilesIn blank_R1.fq.gz,0h_R1.fq.gz,6h_R1.fq.gz,24h_R1.fq.gz \
--outFileNamePrefix /path/to/out_root/ \
--outFileNamePrefixAuto 1 \
--slamQuantMode 1 \
--slamBatchMode 1 \
--slamErrorRateFromBlank 1 \
--slamSnpBed /path/to/snps.bedFor paired-end, pass two comma-separated mate lists:
--readFilesIn blank_R1.fq.gz,0h_R1.fq.gz,... blank_R2.fq.gz,0h_R2.fq.gz,...
STAR-perturb (integrated CR-compat mode):
core/legacy/source/STAR \
--runMode alignReads \
--runThreadN 32 \
--genomeDir /path/to/index \
--pfMultiConfig /path/to/multi_config.csv \
--dynamicThreadInterface 1 \
--dynamicThreadConstMapPermits 32 \
--crAssignSearchThreads 1 \
--defaultCrCompat yes \
--outFileNamePrefix /path/to/outs/STAR-perturb (standalone feature pipeline):
core/legacy/source/star_feature_call \
--compat-perturb \
--feature-ref /path/to/feature_reference.csv \
--whitelist /path/to/whitelist.txt \
--fastq-dir /path/to/feature_fastqs \
--filtered-barcodes /path/to/filtered_barcodes.tsv \
--output-dir /path/to/feature_out \
--emptydrops-use-fdr \
--min-umi 10STAR Suite includes GitHub Codespaces walkthroughs for the main module entry points.
Start here:
Ready now:
Work in progress:
Helpful follow-up guides:
- Core usage: core/legacy/README.md
- Flex pipeline: flex/README_flex.md
- SLAM compatibility: slam/docs/SLAM_COMPATIBILITY_MODE.md
- SLAM methodology: slam/docs/SLAM_seq.md
- STAR-perturb feature docs: docs/feature_barcodes.md
- STAR-perturb A375 parity report: tests/crispr_feature_calling_comparison_report.md
- Cell Ranger multi smoke tool: docs/cr_multi.md
- Docker validation: docs/docker_validation.md