Skip to content

rktmm/data_submission_scripts

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

data_submission_scripts

Scripts to aid with completing metadata forms and uploading data to online repositories, namely NCBI's SRA.

Please see confluence page associated with the scripts for further detail: Confluence page.

process_filenames.sh

This script:

✔ Identifies paired vs single-end reads

✔ Extracts sample names and creates library IDs from listed filenames and organism

✔ Logs unprocessed files

Step 1: Process Filenames for the Metadata File

The input file is a completed BioSample packages CSV file.

process_filenames.sh "input.csv" "processed_paired.csv"

Outputs include:

  • "processed_paired.csv" - contains paired samples
  • "unpaired.csv" - contains single or unpaired samples
  • "unprocessed.csv" - contains samples that were classified as either, needs to be dealt with manually
  • "temp_fastq_files.txt" - intermediate list of FASTQ file

Note. the libraryID is created by default from organism + sample_name from the packages file.

copy_fastq_gcp.sh

This script:

✔ Extracts filenames from the input CSV

✔ Removes any empty lines

✔ Checks which files are already copied

✔ Transfers missing files in Google Storage

✔ Logs failures in uncopied.txt

Step 1: Review or Prepare Metadata File for Input

The metadata file must contain at least 1 column for FASTQ files but reads can be single-, paired- or triple-ended etc): filename, filename2, filename3 . The remaining fields must be completed as described.

Step 2: Copy Only Required FASTQ Files

Run the script copy_fastq_gcp.sh to copy only the files in the metadata to the destination bucket.

copy_fastq_gcp.sh \
  projectID/SRA_metadata.csv \
  "gs://source-bucket" "gs://destination-bucket/"

Ensure source_bucket does not end in / as this will mess with the reiterative search of FASTQ files

Outputs include:

  • "uncopied.txt" - contains a list of files that were unsuccessfully not copied to the destination bucket and need to be reviewed
  • "to_copy.txt" - intermediate file listing files that are not currently present in the destination bucket
  • "file_list.txt" - intermediate file that contains the list of FASTQ files to be transferred

Step 3: Upload data from the destination bucket to online repository.

About

Scripts to aid with completing metadata forms and uploading data to online repositories, namely NCBI's SRA.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 100.0%