Skip to content

Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion#96

Open
anncir1 wants to merge 14 commits intomainfrom
ac/vcf-fixes
Open

Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion#96
anncir1 wants to merge 14 commits intomainfrom
ac/vcf-fixes

Conversation

@anncir1
Copy link
Copy Markdown

@anncir1 anncir1 commented Apr 10, 2026

Summary

This PR addresses 11 bugs discovered while running CRISPRme with new/updated VCF datasets (AoU, TOPMed, HPRC, updated 1KG, gnomAD joint VCFs).

  • BLANK_LINE_IN_SAMPLES_FILE: skip blank lines in init_summary_by_sample() (process_summaries.py)
  • CHR_FROM_FILENAME: read chromosome from VCF content instead of filename; normalize chr prefix (pool_search_indels.py, pool_post_analisi_indel.py, dictionary_creation_indels.py)
  • FILTER_PASS: add --vcf-filter-pass-values CLI flag (default "PASS,"); remove incorrect VCF-rewrite loop from submit_job.sh that recreated .tbi files and broke subsequent steps
  • GT_HAPLOID: guard against haploid GT (single allele, no | separator) in resultIntegrator.py, new_simple_analysis.py, dictionary_creation_indels.py
  • CONCURRENT_RUNS: mktemp -d unique temp dir and mkdir lockfile guards on shared build steps in submit_job.sh
  • DTYPE_WARNING: add dtype=str to chunked pd.read_csv in remove_n_and_dots.py
  • TBI_FILTER: use .endswith(".vcf.gz") instead of "vcf.gz" in f to exclude .vcf.gz.tbi tabix index files from chromosome list (pool_post_analisi_indel.py)
  • TABIX_CONCURRENT_WRITE: fcntl.flock exclusive lock + mtime check in annotation.py to prevent parallel processes from corrupting the tabix index
  • MISSING_ALLELE: guard against . (missing) alleles in resultIntegrator.py, triggered by haploid assembly samples (e.g. CHM13 in HPRC vcfwave)
  • GNOMAD_CONVERTER: add --af-threshold and --output-dir flags to convert_gnomAD_vcfs.py; fix AF_joint field lookup for joint VCFs

Testing

All fixes were validated by running crisprme.py complete-test --chrom chr22 and crisprme.py validate-test --chrom chr22, plus full-genome runs with AoU VCF datasets.

anncir1 added 14 commits April 10, 2026 15:04
…ample()

Trailing newline in samplesID file caused IndexError on splitted[3],
crashing post-processing after all off-target search had completed.
Filename-based chromosome extraction failed for VCFs whose names do not
embed the chromosome (e.g. whole-genome single-file VCFs, DeepVariant output).
Replace with _chrom_from_vcf() helper that reads the CHROM column from the
first data line in pool_post_analisi_indel.py and pool_search_indels.py.
In dictionary_creation_indels.py, set chr_name directly from first_line[0]
which is already read from the file.
Expose a new --vcf-filter-pass-values CLI flag on crisprme.py so users
can control which VCF FILTER values are treated as passing. The default
"PASS,." covers the VCF spec §1.6.1 "not evaluated" sentinel (".") in
addition to the standard PASS value, making it backwards-compatible.

Changes:
- crisprme.py: parse --vcf-filter-pass-values in complete_search() and
  gnomAD_converter(); pass the value to the shell script (as positional
  arg 24) and to convert_gnomAD_vcfs.py as a 7th positional argument.
- submit_job_automated_new_multiple_vcfs.sh: accept vcf_filter_pass_values
  as ${24} (default "PASS,."); insert an awk loop before crispritz.py
  add-variants that rewrites accepted non-PASS FILTER values to PASS so
  crispritz includes those records during genome enrichment.
- convert_gnomAD_vcfs.py: accept filter_pass_values as an optional 7th
  argument (parse_commandline); add itertools import; add a pre-flight
  warning if no accepted-filter records are found in the first 10k rows;
  replace the hardcoded "PASS" check with a set-intersection test against
  filter_pass_values throughout convert_vcf() and run_conversion_pipeline().
Two code paths assumed all GTs are diploid and split on "|", raising
IndexError when encountering hemizygous records (e.g. chrX in male samples).

- resultIntegrator.py: split into alleles list and use alleles[0] for
  allele2 when only one element is present, correctly treating hemizygous
  as homozygous for haplotype frequency counting.
- dictionary_creation_indels.py: after splitting on "|", duplicate a
  single-element list so the haploid allele is counted twice, matching
  the biological expectation for hemizygous variants.
… steps

Two classes of race condition existed when multiple CRISPRme jobs ran
simultaneously:

1. Shared variants_genome/ temp folder: crispritz.py add-variants always
   writes to variants_genome/ relative to its working directory, which was
   fixed at Genomes/. Concurrent jobs with different VCF sets would
   overwrite each other's intermediate enriched genome.

   Fix: create a per-run temp dir with mktemp -d (run_XXXXXX under Genomes/)
   and cd into it before calling crispritz, so each run gets its own
   variants_genome/ subtree. The temp dir is removed after the mv steps.

2. Shared index/enriched-genome folders built without atomicity: two runs
   with identical parameters could both pass an "if ! [ -d folder ]" check
   simultaneously and interleave writes into the same directory, corrupting
   the output.

   Fix: wrap all four shared-folder build steps with a mkdir-based lock
   (atomic on POSIX). The winning process builds the folder and rmdir's the
   lock; losing processes wait in a 5-second poll loop until the lock
   disappears. Applied to: enriched genome, ref genome index, variant genome
   index, and indels index.
…nd_dots

Without an explicit dtype, pandas infers column types independently per
chunk. A column that is all-NaN in one chunk and has values in another gets
different inferred dtypes, triggering a DtypeWarning to stderr. Because the
pipeline treats any non-empty log_error.txt as a fatal error, this warning
causes runs to be reported as failed even when all results are correct.

Adding dtype=str suppresses the inference entirely. It is also semantically
correct: the sole purpose of this script is text replacement ("n" -> "NA",
"." -> "NA"), so there is no reason to parse numeric types at all.
Some VCF datasets (e.g. 1000G GRCh38) store chromosomes without a chr
prefix in the CHROM field (e.g. '22' instead of 'chr22'), while the
genome indices and fake FASTA files created by pool_index_indels.py use
chr-prefixed names. Add _normalize_chrom() to pool_search_indels.py and
pool_post_analisi_indel.py to ensure the derived chromosome name always
matches the index naming convention.
…stence check

Existence-only check missed stale .tbi after complete-test re-bgzips annotation.
Use fcntl.flock exclusive lock so only one parallel annotation process regenerates
the index, and only when the .tbi is absent or older than the .gz.
The awk+bgzip+tabix loop that normalised non-PASS FILTER values to PASS
before calling crispritz was incorrect — it rewrote source VCF files on
disk (replacing symlinks with real files and creating .tbi files), which
then triggered a pre-existing crispritz modify-while-iterating bug on the
.tbi files in the VCF directory.

VCF files must never be modified by CRISPRme. The FILTER_PASS flag is
already correctly handled for gnomAD VCFs in convert_gnomAD_vcfs.py
(in-memory filtering during conversion). For VCFs passed directly to
crispritz (1000G, AoU, etc.), those files are expected to already carry
appropriate FILTER values (PASS) and no rewriting is needed.
Same IndexError as Fix #4 — sampleInfo[1].split('|')[haplo] crashes when
GT has no pipe (haploid encoding). Triggered by 1KG 2021 eagle2-phased
chrX VCF which stores male hemizygous calls as bare '0'/'1' rather than
forcing diploid '0|0'/'1|1' as shapeit2 did in the 2019 VCF.
…AF field lookup

- parse_commandline: parse optional --af-threshold and --output-dir keyword args
  from sys.argv (positional args unchanged for backward compat); create output_dir
  if specified
- convert_vcf: add af_threshold (default 0.0, no filter) and output_dir (default
  "", same dir as input) params; skip variants where AF_joint < af_threshold;
  write output to output_dir when specified; set af_field = "AF_joint" if joint
- format_variant_record: add af_field param (default "AF") so gnomAD joint VCFs
  correctly read AF_joint instead of falling back to AF=0.0 in the output INFO
- run_conversion_pipeline: pass af_threshold and output_dir through to convert_vcf
- convert_gnomad_vcfs: unpack two new values from parse_commandline; pass to pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant