This repository contains the codes used for processing the data and producing the figures in the paper: The mutation profile of SARS-CoV-2 is primarily shaped by the host antiviral defense
-
This workflow is prepared using Snakemake workflow management system and conda
-
To run the workflow, you should have conda installed for environment management. All the other packages including Snakemake and their dependencies can be obtained automatically through environments prepared for each step of the workflow. You can follow the installation steps from the link.
-
Initially, you should clone the repository and navigate into the directory:
git clone https://github.com/CompGenomeLab/SARS-CoV-2_Mutational_Profile.git cd SARS-CoV-2_Mutational_Profile/
-
Next, you should create a conda environment with the defined packages. We propose 2 way to create the environment:
-
One is installing mamba and creating the environment using mamba:
conda install -c conda-forge mamba mamba create -c bioconda -c conda-forge -c r -n covid19 snakemake python=3.8 conda activate covid19 mamba install -c conda-forge -c bioconda augur=9.0.0
-
Or the environment can be directly created from our environment file:
conda env create -f workflow/envs/env.yaml conda activate covid19
-
Note: The steps described here apply to the Linux operating system. There may be slight differences in codes between operating systems. |
---|
This workflow is prepared according to the structure recommended by Snakemake:
config/
: contains the configuration file.resources/
: contains the input files. Although some of the files are provided in the repository, the raw genomes, their pre-computed multiple sequence alignment, and their metadata file, which are retrieved from GISAID database [1] are not provided. This 3 data must be obtained and stored in theresources/
directory before running the workflow and their names and directories must be properly added to the configuration file.results/
: contains the generated files and figures. Inside this directory, we provided a folder calledprocessed/
, which contains the generated results of the paper.logs/
: contains the log files of each step. This folder will automatically appear when you run the workflow.workflow/
: contains the mainSnakefile
,envs/
where the environments are stored,rules/
where the Snakemake rules are stored, andscripts/
where the scripts used inside the rules are stored.
Before running the workflow, you should edit the configuration file.
output_name
: all the outputs will be stored inresults/
directory inside this folder. In addition, the <output_name> will be added to the names of the generated files.fasta_file
: name of the input fasta file. Ex:"2021-02-09"
msa
: name of the pre-computed multiple sequence alignment.fasta_folder
: directory where fasta file and pre-computed multiple sequence alignment are stored. Ex:"resources/genomes/2021-02-09"
.info_file
: name of the input metadata file.info folder
: directory where input metadata file is stored.subset
: # of remaining sequences after subsampling step.time_window
: range of genome dates that will be used in the analyses (Any genome with a date value that is out of this range will be discarded from the analyses.). Ex:"2019-01-01:2022-01-01"
redo
: repeating the building the tree (true or false). It is recommeded to set it as true in case a corrupted tree file is produced and needed to be reproduced.reference_fa
: name of the reference fasta file. The directory isresources/reference_genome/
. If you want to change the reference, you should store it in the given directory and set the name of the file here.reference_gff
: name of the reference gff file. The directory isresources/reference_genome/
. If you want to change the reference, you should store it in the given directory and set the name of the file here.cdhit
: sequence identity treshold set for the cd-hit algorithm [2]. You can provide multiple tresholds in a list, which will result in generation of downstream outputs for each treshold value of cd-hit, seperately.
After adjusting the configuration file, you can run the workflow from SARS-CoV-2_Mutational_Profile/
directory:
snakemake -pr --use-conda --cores 64 --debug-dag
Figure 1: A directed acyclic graph (DAG) of the analysis.
[1] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data -from vision to reality. Euro Surveill, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
[2] Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150–3152. doi:10.1093/bioinformatics/bts565