Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion by anncir1 · Pull Request #96 · pinellolab/CRISPRme

anncir1 · 2026-04-10T15:19:21Z

Summary

This PR addresses 11 bugs discovered while running CRISPRme with new/updated VCF datasets (AoU, TOPMed, HPRC, updated 1KG, gnomAD joint VCFs).

BLANK_LINE_IN_SAMPLES_FILE: skip blank lines in init_summary_by_sample() (process_summaries.py)
CHR_FROM_FILENAME: read chromosome from VCF content instead of filename; normalize chr prefix (pool_search_indels.py, pool_post_analisi_indel.py, dictionary_creation_indels.py)
FILTER_PASS: add --vcf-filter-pass-values CLI flag (default "PASS,"); remove incorrect VCF-rewrite loop from submit_job.sh that recreated .tbi files and broke subsequent steps
GT_HAPLOID: guard against haploid GT (single allele, no | separator) in resultIntegrator.py, new_simple_analysis.py, dictionary_creation_indels.py
CONCURRENT_RUNS: mktemp -d unique temp dir and mkdir lockfile guards on shared build steps in submit_job.sh
DTYPE_WARNING: add dtype=str to chunked pd.read_csv in remove_n_and_dots.py
TBI_FILTER: use .endswith(".vcf.gz") instead of "vcf.gz" in f to exclude .vcf.gz.tbi tabix index files from chromosome list (pool_post_analisi_indel.py)
TABIX_CONCURRENT_WRITE: fcntl.flock exclusive lock + mtime check in annotation.py to prevent parallel processes from corrupting the tabix index
MISSING_ALLELE: guard against . (missing) alleles in resultIntegrator.py, triggered by haploid assembly samples (e.g. CHM13 in HPRC vcfwave)
GNOMAD_CONVERTER: add --af-threshold and --output-dir flags to convert_gnomAD_vcfs.py; fix AF_joint field lookup for joint VCFs

Testing

All fixes were validated by running crisprme.py complete-test --chrom chr22 and crisprme.py validate-test --chrom chr22, plus full-genome runs with AoU VCF datasets.

…ample() Trailing newline in samplesID file caused IndexError on splitted[3], crashing post-processing after all off-target search had completed.

Filename-based chromosome extraction failed for VCFs whose names do not embed the chromosome (e.g. whole-genome single-file VCFs, DeepVariant output). Replace with _chrom_from_vcf() helper that reads the CHROM column from the first data line in pool_post_analisi_indel.py and pool_search_indels.py. In dictionary_creation_indels.py, set chr_name directly from first_line[0] which is already read from the file.

Expose a new --vcf-filter-pass-values CLI flag on crisprme.py so users can control which VCF FILTER values are treated as passing. The default "PASS,." covers the VCF spec §1.6.1 "not evaluated" sentinel (".") in addition to the standard PASS value, making it backwards-compatible. Changes: - crisprme.py: parse --vcf-filter-pass-values in complete_search() and gnomAD_converter(); pass the value to the shell script (as positional arg 24) and to convert_gnomAD_vcfs.py as a 7th positional argument. - submit_job_automated_new_multiple_vcfs.sh: accept vcf_filter_pass_values as ${24} (default "PASS,."); insert an awk loop before crispritz.py add-variants that rewrites accepted non-PASS FILTER values to PASS so crispritz includes those records during genome enrichment. - convert_gnomAD_vcfs.py: accept filter_pass_values as an optional 7th argument (parse_commandline); add itertools import; add a pre-flight warning if no accepted-filter records are found in the first 10k rows; replace the hardcoded "PASS" check with a set-intersection test against filter_pass_values throughout convert_vcf() and run_conversion_pipeline().

Two code paths assumed all GTs are diploid and split on "|", raising IndexError when encountering hemizygous records (e.g. chrX in male samples). - resultIntegrator.py: split into alleles list and use alleles[0] for allele2 when only one element is present, correctly treating hemizygous as homozygous for haplotype frequency counting. - dictionary_creation_indels.py: after splitting on "|", duplicate a single-element list so the haploid allele is counted twice, matching the biological expectation for hemizygous variants.

… steps Two classes of race condition existed when multiple CRISPRme jobs ran simultaneously: 1. Shared variants_genome/ temp folder: crispritz.py add-variants always writes to variants_genome/ relative to its working directory, which was fixed at Genomes/. Concurrent jobs with different VCF sets would overwrite each other's intermediate enriched genome. Fix: create a per-run temp dir with mktemp -d (run_XXXXXX under Genomes/) and cd into it before calling crispritz, so each run gets its own variants_genome/ subtree. The temp dir is removed after the mv steps. 2. Shared index/enriched-genome folders built without atomicity: two runs with identical parameters could both pass an "if ! [ -d folder ]" check simultaneously and interleave writes into the same directory, corrupting the output. Fix: wrap all four shared-folder build steps with a mkdir-based lock (atomic on POSIX). The winning process builds the folder and rmdir's the lock; losing processes wait in a 5-second poll loop until the lock disappears. Applied to: enriched genome, ref genome index, variant genome index, and indels index.

…nd_dots Without an explicit dtype, pandas infers column types independently per chunk. A column that is all-NaN in one chunk and has values in another gets different inferred dtypes, triggering a DtypeWarning to stderr. Because the pipeline treats any non-empty log_error.txt as a fatal error, this warning causes runs to be reported as failed even when all results are correct. Adding dtype=str suppresses the inference entirely. It is also semantically correct: the sole purpose of this script is text replacement ("n" -> "NA", "." -> "NA"), so there is no reason to parse numeric types at all.

Some VCF datasets (e.g. 1000G GRCh38) store chromosomes without a chr prefix in the CHROM field (e.g. '22' instead of 'chr22'), while the genome indices and fake FASTA files created by pool_index_indels.py use chr-prefixed names. Add _normalize_chrom() to pool_search_indels.py and pool_post_analisi_indel.py to ensure the derived chromosome name always matches the index naming convention.

…l_post_analisi_indel

…nt parallel write race

…stence check Existence-only check missed stale .tbi after complete-test re-bgzips annotation. Use fcntl.flock exclusive lock so only one parallel annotation process regenerates the index, and only when the .tbi is absent or older than the .gz.

The awk+bgzip+tabix loop that normalised non-PASS FILTER values to PASS before calling crispritz was incorrect — it rewrote source VCF files on disk (replacing symlinks with real files and creating .tbi files), which then triggered a pre-existing crispritz modify-while-iterating bug on the .tbi files in the VCF directory. VCF files must never be modified by CRISPRme. The FILTER_PASS flag is already correctly handled for gnomAD VCFs in convert_gnomAD_vcfs.py (in-memory filtering during conversion). For VCFs passed directly to crispritz (1000G, AoU, etc.), those files are expected to already carry appropriate FILTER values (PASS) and no rewriting is needed.

…lotype counting

Same IndexError as Fix #4 — sampleInfo[1].split('|')[haplo] crashes when GT has no pipe (haploid encoding). Triggered by 1KG 2021 eagle2-phased chrX VCF which stores male hemizygous calls as bare '0'/'1' rather than forcing diploid '0|0'/'1|1' as shapeit2 did in the 2019 VCF.

…AF field lookup - parse_commandline: parse optional --af-threshold and --output-dir keyword args from sys.argv (positional args unchanged for backward compat); create output_dir if specified - convert_vcf: add af_threshold (default 0.0, no filter) and output_dir (default "", same dir as input) params; skip variants where AF_joint < af_threshold; write output to output_dir when specified; set af_field = "AF_joint" if joint - format_variant_record: add af_field param (default "AF") so gnomAD joint VCFs correctly read AF_joint instead of falling back to AF=0.0 in the output INFO - run_conversion_pipeline: pass af_threshold and output_dir through to convert_vcf - convert_gnomad_vcfs: unpack two new values from parse_commandline; pass to pipeline

anncir1 added 14 commits April 10, 2026 15:04

Fix BLANK_LINE_IN_SAMPLES_FILE: skip blank lines in init_summary_by_s…

1974df2

…ample() Trailing newline in samplesID file caused IndexError on splitted[3], crashing post-processing after all off-target search had completed.

Fix TBI_FILTER: exclude .vcf.gz.tbi files from chromosome list in poo…

5743ba0

…l_post_analisi_indel

Fix TABIX_CONCURRENT_WRITE: only regenerate .tbi when absent to preve…

6a47dd8

…nt parallel write race

Fix MISSING_ALLELE: guard against '.' alleles in resultIntegrator hap…

c85f195

…lotype counting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion#96

Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion#96
anncir1 wants to merge 14 commits intomainfrom
ac/vcf-fixes

anncir1 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anncir1 commented Apr 10, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant