-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the Wiki page of Phoenix-Novel-Species. Please find the several code snippets used for characterising the 53 novel strains isolated from the NASA cleanroom Payload Hazardous Servicing Facility (PHSF) located in Kennedy Space Center (KSC), Florida, during the Phoenix mission spacecraft assembly.
We conducted quality checks of the raw reads (single-end) using FastQC v.0.12.0 (https://github.com/s-andrews/FastQC)
fastqc --threads 16 $r1 -o qscores
Further, Canu v.2.2 (https://github.com/marbl/canu), Flye v.2.9.1 (https://github.com/fenderglass/Flye), and Unicycler v.0.5.0 (https://github.com/rrwick/Unicycler) were utilised on the filtered reads for de novo assembly of the genomes
# Canu
canu -d ${assemblyDir} -p ${prefix} genomeSize=5m -nanopore ${lRead} useGrid=true minReadLength=5000 gridOptions="-t 72:00:00"
# Flye
flye --nano-corr ${lRead} -g 5m --asm-coverage 50 -o ${assemblyDir} --scaffold -t 8 -i 2
# Unicycler
unicycler -t 8 -l ${lR} -o ${assemblyDir}
Genome dereplication was performed using dRep v. 3.4.5 (https://github.com/MrOlm/drep)
dRep dereplicate ${drepDir} -g ${genomeDir}/*.fasta -p 16
Completion and contamination of the assembled genomes were identified using CheckM v.1.2.2 (https://github.com/Ecogenomics/CheckM)
checkm lineage_wf ${genomeDir}/ checkm/ -x .fasta -t 32
Further, the presence of plasmids was identified using MOB-suite v.3.1.8 (https://github.com/phac-nml/mob-suite)
mob_typer --infile ${genomeDir}/${sample}.fasta --out_file Plasmid/${sample}_mobtyper.txt
The final genomes were considered for the calculation of related indices with other type strains.
We collected all validly published representative genomes of the identified genera from NCBI using the command line tool bit (https://github.com/AstrobioMike/bit) and NCBI Entrez-direct
# installation
conda install -c bioconda entrez-direct
conda install -c conda-forge -c bioconda -c defaults -c astrobiomike bit
# esearch to get the list of representative species accession IDs
esearch -query '<Agrococcus>[ORGN] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > <Agrococcus>_repr.txt
# bit to fetch the genomes
bit-dl-ncbi-assemblies -w <Agrococcus>_repr.txt -f fasta -j 100
# unzipping
gunzip *.gz
We calculated the Average Nucleotide Identity of the novel species with the other type strains using FastANI v.1.34 (https://github.com/ParBLiSS/FastANI)
# get all the paths of the genomes
find ${genome_dir} -iname *.fasta > Novel_Genomes_Paths.txt
# perform all vs all ANI with the genomes
fastANI --rl Novel_Genomes_Paths.txt --ql Novel_Genomes_Paths.txt --threads 50 --matrix -o Novel_Genomes_ANI.tsv
Further, Average Amino acid Identity (AAI) values were computed using aai.rb function from the Enveomics collection toolbox (https://github.com/lmrodriguezr/enveomics)
aai.rb --seq1 <genome> --seq2 <ref_genome> --threads 50 --res AAI_output.txt
We annotated 53 novel strains using the widely-used command-line annotation tool Prokka (https://github.com/tseemann/prokka)
# prokka command
# From the directory of the .fasta files
for file in *.fasta; do prokka --outdir ../Prokka_Annotation/"$file" --prefix "$file" --cpus 40 "$file"; done
From the annotated genomes, 16S rRNA and DNA gyrase subunit-B (gyrB) genes were extracted for the novel species and the closest type strain, and sequence identity was estimated using Blast v.2.13.0
# 16S rRNA sequence extraction
pattern="16S ribosomal RNA"
IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
echo "Getting 16S sequence for: $i"
# Use awk to find and print the longest sequence between headers containing the pattern
awk -v pattern="$pattern" '
BEGIN { RS=">"; FS="\n" }
tolower($1) ~ tolower(pattern) {
sequence = $0
gsub(/\n/, "", sequence) # Remove newline characters
length_seq = length(sequence)
# Update max_length and max_sequence if the current sequence is longer
if (length_seq > max_length) {
max_length = length_seq
max_sequence = ">" $0
}
}
END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_16S.fasta"
done
# gyrB sequence extraction
pattern_gyrb="DNA gyrase subunit B"
echo "Getting gyrB sequences"
IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
echo "Getting 16S sequence for: $i"
# Use awk to find and print the longest sequence between headers containing the pattern
awk -v pattern="$pattern_gyrb" '
BEGIN { RS=">"; FS="\n" }
tolower($1) ~ tolower(pattern) {
sequence = $0
gsub(/\n/, "", sequence) # Remove newline characters
length_seq = length(sequence)
# Update max_length and max_sequence if the current sequence is longer
if (length_seq > max_length) {
max_length = length_seq
max_sequence = ">" $0
}
}
END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_gyrB.fasta"
done
# BLAST
blastn -db 16S_ribosomal_RNA -query $i -task blastn -dust no -max_target_seqs 1 -outfmt "10 delim=, qacc sacc evalue bitscore qcovus pident staxid ssciname" -num_threads 40
We utilised GToTree v.1.8.2 (https://github.com/AstrobioMike/GToTree) for preparing the phylogenetic trees using single-copy core genes, followed by IQTREE-2 v.2.2.0.3 (https://github.com/iqtree/iqtree2) to perform 1000 ultrafast bootstrap replicates
# GToTree
GToTree -a ${genus}_non_novel_accession.txt -f ${genus}_paths.txt -H $scg_set -t -L Species,Strain -T IQ-TREE -j 50 -m GToTree_Mapping_Files/${genus}_GToTree_Mapping.tsv -o $genus
# IQTREE-2
iqtree2 -s $(genus}/Aligned_SCGs_mod_names.faa -spp ${genus}/run_files/Partitions.txt -m txt -m MFP -bb 1000 -nt 40 -pre ${genus}_IQ_Tree
We identified the Cluster of Orthologous Genes (COGs) using the Prokka annotated genomes. We used python-package cogclassifier https://pypi.org/project/cogclassifier/)
# cogclassfier command
# Once Prokka is over, utilise .faa files for COG identification
for file in ./*/*.faa; do cp "$file" ./../COGs/; done
# From the directory of the .faa files
for files in *.faa; do COGclassifier -i "$files" -o ./COG_Output/"$files"; done
We then predicted, annotated and analysed the secondary metabolite biosynthesis gene clusters in these novel strains using command-line antiSMASH 7.0 (https://antismash.secondarymetabolites.org/#!/about)
# antismash command
# From the directory of the .fasta files
for file in *.fasta; do antismash "$file" --taxon bacteria --cpus 20 --genefinding-tool prodigal --cb-knownclusters --output-dir ./../antiSMASH/"$file" --cc-mibig --fullhmmer; done
Further, we employed the command-line tool Resistance Gene Identifier (RGI) (https://github.com/arpcard/rgi) to predict antibiotic resistome(s) in these novel genomes. We searched these antibiotic resistance genes (ARGs) against the Comprehensive Antibiotic Resistance Database (CARD)(https://card.mcmaster.ca/)
# rgi command
# From the directory of the .fasta files
for file in *.fasta; do rgi main -i "$file" -o ./RGI_CARD/"$file" -t contig -a BLAST -n 50; done
We used fastp v.0.22.0 (https://github.com/OpenGene/fastp) to filter the raw shotgun sequences obtained from NASA cleanrooms
fastp.0.23.4 --in1 ${metagenomeFolder}/$read1 --out1 ${read1}_output.fastq --in2 ${metagenomeFolder}/{read2} --out2 ${read2}_output.fastq -w 16 --json ${metagenomeFolder}_fastp.json --html ${metagenomeFolder}_fastp.html
Then we utilised MetaCompass v.2.0 (https://github.com/marbl/MetaCompass) to align the filtered reads to newly identified genomes and determine their abundance in the NASA cleanrooms based on mapped reads
python3 ../../MetaCompass/go_metacompass.py -r $genome -1 "$read1" -2 "$read2" -l 150 -t 70 -y 50 -o Metagenomic_Mapping/${name}_${i}