Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion#96
Open
Fix 10 bugs affecting new VCF dataset processing, concurrent runs, and gnomAD conversion#96
Conversation
…ample() Trailing newline in samplesID file caused IndexError on splitted[3], crashing post-processing after all off-target search had completed.
Filename-based chromosome extraction failed for VCFs whose names do not embed the chromosome (e.g. whole-genome single-file VCFs, DeepVariant output). Replace with _chrom_from_vcf() helper that reads the CHROM column from the first data line in pool_post_analisi_indel.py and pool_search_indels.py. In dictionary_creation_indels.py, set chr_name directly from first_line[0] which is already read from the file.
Expose a new --vcf-filter-pass-values CLI flag on crisprme.py so users
can control which VCF FILTER values are treated as passing. The default
"PASS,." covers the VCF spec §1.6.1 "not evaluated" sentinel (".") in
addition to the standard PASS value, making it backwards-compatible.
Changes:
- crisprme.py: parse --vcf-filter-pass-values in complete_search() and
gnomAD_converter(); pass the value to the shell script (as positional
arg 24) and to convert_gnomAD_vcfs.py as a 7th positional argument.
- submit_job_automated_new_multiple_vcfs.sh: accept vcf_filter_pass_values
as ${24} (default "PASS,."); insert an awk loop before crispritz.py
add-variants that rewrites accepted non-PASS FILTER values to PASS so
crispritz includes those records during genome enrichment.
- convert_gnomAD_vcfs.py: accept filter_pass_values as an optional 7th
argument (parse_commandline); add itertools import; add a pre-flight
warning if no accepted-filter records are found in the first 10k rows;
replace the hardcoded "PASS" check with a set-intersection test against
filter_pass_values throughout convert_vcf() and run_conversion_pipeline().
Two code paths assumed all GTs are diploid and split on "|", raising IndexError when encountering hemizygous records (e.g. chrX in male samples). - resultIntegrator.py: split into alleles list and use alleles[0] for allele2 when only one element is present, correctly treating hemizygous as homozygous for haplotype frequency counting. - dictionary_creation_indels.py: after splitting on "|", duplicate a single-element list so the haploid allele is counted twice, matching the biological expectation for hemizygous variants.
… steps Two classes of race condition existed when multiple CRISPRme jobs ran simultaneously: 1. Shared variants_genome/ temp folder: crispritz.py add-variants always writes to variants_genome/ relative to its working directory, which was fixed at Genomes/. Concurrent jobs with different VCF sets would overwrite each other's intermediate enriched genome. Fix: create a per-run temp dir with mktemp -d (run_XXXXXX under Genomes/) and cd into it before calling crispritz, so each run gets its own variants_genome/ subtree. The temp dir is removed after the mv steps. 2. Shared index/enriched-genome folders built without atomicity: two runs with identical parameters could both pass an "if ! [ -d folder ]" check simultaneously and interleave writes into the same directory, corrupting the output. Fix: wrap all four shared-folder build steps with a mkdir-based lock (atomic on POSIX). The winning process builds the folder and rmdir's the lock; losing processes wait in a 5-second poll loop until the lock disappears. Applied to: enriched genome, ref genome index, variant genome index, and indels index.
…nd_dots
Without an explicit dtype, pandas infers column types independently per
chunk. A column that is all-NaN in one chunk and has values in another gets
different inferred dtypes, triggering a DtypeWarning to stderr. Because the
pipeline treats any non-empty log_error.txt as a fatal error, this warning
causes runs to be reported as failed even when all results are correct.
Adding dtype=str suppresses the inference entirely. It is also semantically
correct: the sole purpose of this script is text replacement ("n" -> "NA",
"." -> "NA"), so there is no reason to parse numeric types at all.
Some VCF datasets (e.g. 1000G GRCh38) store chromosomes without a chr prefix in the CHROM field (e.g. '22' instead of 'chr22'), while the genome indices and fake FASTA files created by pool_index_indels.py use chr-prefixed names. Add _normalize_chrom() to pool_search_indels.py and pool_post_analisi_indel.py to ensure the derived chromosome name always matches the index naming convention.
…l_post_analisi_indel
…nt parallel write race
…stence check Existence-only check missed stale .tbi after complete-test re-bgzips annotation. Use fcntl.flock exclusive lock so only one parallel annotation process regenerates the index, and only when the .tbi is absent or older than the .gz.
The awk+bgzip+tabix loop that normalised non-PASS FILTER values to PASS before calling crispritz was incorrect — it rewrote source VCF files on disk (replacing symlinks with real files and creating .tbi files), which then triggered a pre-existing crispritz modify-while-iterating bug on the .tbi files in the VCF directory. VCF files must never be modified by CRISPRme. The FILTER_PASS flag is already correctly handled for gnomAD VCFs in convert_gnomAD_vcfs.py (in-memory filtering during conversion). For VCFs passed directly to crispritz (1000G, AoU, etc.), those files are expected to already carry appropriate FILTER values (PASS) and no rewriting is needed.
Same IndexError as Fix #4 — sampleInfo[1].split('|')[haplo] crashes when GT has no pipe (haploid encoding). Triggered by 1KG 2021 eagle2-phased chrX VCF which stores male hemizygous calls as bare '0'/'1' rather than forcing diploid '0|0'/'1|1' as shapeit2 did in the 2019 VCF.
…AF field lookup - parse_commandline: parse optional --af-threshold and --output-dir keyword args from sys.argv (positional args unchanged for backward compat); create output_dir if specified - convert_vcf: add af_threshold (default 0.0, no filter) and output_dir (default "", same dir as input) params; skip variants where AF_joint < af_threshold; write output to output_dir when specified; set af_field = "AF_joint" if joint - format_variant_record: add af_field param (default "AF") so gnomAD joint VCFs correctly read AF_joint instead of falling back to AF=0.0 in the output INFO - run_conversion_pipeline: pass af_threshold and output_dir through to convert_vcf - convert_gnomad_vcfs: unpack two new values from parse_commandline; pass to pipeline
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses 11 bugs discovered while running CRISPRme with new/updated VCF datasets (AoU, TOPMed, HPRC, updated 1KG, gnomAD joint VCFs).
init_summary_by_sample()(process_summaries.py)chrprefix (pool_search_indels.py,pool_post_analisi_indel.py,dictionary_creation_indels.py)--vcf-filter-pass-valuesCLI flag (default"PASS,"); remove incorrect VCF-rewrite loop fromsubmit_job.shthat recreated.tbifiles and broke subsequent steps|separator) inresultIntegrator.py,new_simple_analysis.py,dictionary_creation_indels.pymktemp -dunique temp dir andmkdirlockfile guards on shared build steps insubmit_job.shdtype=strto chunkedpd.read_csvinremove_n_and_dots.py.endswith(".vcf.gz")instead of"vcf.gz" in fto exclude.vcf.gz.tbitabix index files from chromosome list (pool_post_analisi_indel.py)fcntl.flockexclusive lock + mtime check inannotation.pyto prevent parallel processes from corrupting the tabix index.(missing) alleles inresultIntegrator.py, triggered by haploid assembly samples (e.g. CHM13 in HPRC vcfwave)--af-thresholdand--output-dirflags toconvert_gnomAD_vcfs.py; fixAF_jointfield lookup for joint VCFsTesting
All fixes were validated by running
crisprme.py complete-test --chrom chr22andcrisprme.py validate-test --chrom chr22, plus full-genome runs with AoU VCF datasets.