-
Notifications
You must be signed in to change notification settings - Fork 1
3a. RepGenR stand‐alone examples
This example shows a full workflow of RepGenR.
- Downloading metadata, selecting Francisella tularensis:
repgenr metadata --workdir tularensis --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis
- Download genomes (almost 900 genomes):
repgenr genome --workdir tularensis
- Dereplicate the ~900 genomes using 99% ANI, divided into 3 batches to increase performance:
repgenr derep --workdir tularensis --secondary_ani 0.99 --process_size 300 --num_processes 3 --threads 70
- Compute phylogeny of dereplicated genomes using the accurate method:
repgenr phylo --workdir tularensis --mode accurate
Figure showing phylogeny of dereplicated Francisella tularensis datasets.
- Output the parent-child relations-file:
repgenr tree2tax --workdir tularensis
Content of parent-child relations-file (branch-nodes named as the hash of their descendants):
child parent
Francisellaceae_Francisella_sp002095075_GCA_002095075.2 root
Francisellaceae_Francisella_tularensis_GCF_001870885.1 0f7a3c69c7a74e54e710b4bcaad38c34
0f7a3c69c7a74e54e710b4bcaad38c34 root
Francisellaceae_Francisella_tularensis_GCF_000742085.1 d5e86a1f8635b57f94389d046c00ea0c
d5e86a1f8635b57f94389d046c00ea0c 0f7a3c69c7a74e54e710b4bcaad38c34
Francisellaceae_Francisella_tularensis_GCF_000168775.2 b75d4f08baae50cddf3df8a1395b76dc
b75d4f08baae50cddf3df8a1395b76dc 7b8d9247995e31ea51977a27d2720532
7b8d9247995e31ea51977a27d2720532 d5e86a1f8635b57f94389d046c00ea0c
Francisellaceae_Francisella_tularensis_GCF_016604535.1 f7a6b750e06844ffe53f0003cca9c32f
f7a6b750e06844ffe53f0003cca9c32f 7b8d9247995e31ea51977a27d2720532
Francisellaceae_Francisella_tularensis_GCF_000014645.1 898615f9a2bedbd8a8e468c1be231cb0
898615f9a2bedbd8a8e468c1be231cb0 b75d4f08baae50cddf3df8a1395b76dc
Francisellaceae_Francisella_tularensis_GCF_002952075.1 898615f9a2bedbd8a8e468c1be231cb0
Francisellaceae_Francisella_tularensis_GCF_001865695.1 c15e5c8f34dbf0e7e2ba05c7d7322e15
c15e5c8f34dbf0e7e2ba05c7d7322e15 f7a6b750e06844ffe53f0003cca9c32f
Francisellaceae_Francisella_tularensis_GCF_000195535.1 c2b295fdb439928d438e3466599181d0
c2b295fdb439928d438e3466599181d0 c15e5c8f34dbf0e7e2ba05c7d7322e15
Francisellaceae_Francisella_tularensis_GCF_000833355.1 c2b295fdb439928d438e3466599181d0
- Downloading metadata, selecting Francisella tularensis:
repgenr metadata --workdir tularensis_full --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis
- Download genomes:
repgenr genome --workdir tularensis_full
- Compute phylogeny of all genomes using the accurate method:
repgenr phylo --workdir tularensis_full --mode accurate --all_genomes
- Output the parent-child relations-file:
repgenr tree2tax --workdir tularensis_full --all_genomes
In this example, RepGenR is used to obtain conveniently named files for multiple families (Francisellaceae, Burkholderiaceae, and Bacillaceae) and selected species (tularensis, mallei, and anthracis) within the families. For families, "representative" genomes are selected and for species, "all" genomes are selected. Please see GTDB taxonomy browser at (https://gtdb.ecogenomic.org/) to identify taxa.
- Make a work-directory for RepGenR:
mkdir repgenr_download
- Download Francisellaceae representative-genomes:
repgenr metadata --workdir repgenr_download/francisellaceae --release 214.0 --version bac120 --dataset rep --level family --target_family francisellaceae
- Download Burkholderiaceae and Bacillaceae representative-genome metadata (re-using the previously downloaded metadata-file from GTDB in the above command, by specifying
--metadata_path
):
repgenr metadata --workdir repgenr_download/burkholderiaceae --release 214.0 --version bac120 --dataset rep --level family --target_family burkholderiaceae --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
repgenr metadata --workdir repgenr_download/bacillaceae --release 214.0 --version bac120 --dataset rep --level family --target_family bacillaceae --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
- Download all tularensis, mallei, and anthracis genome metadata (re-using the previously downloaded metadata-file):
repgenr metadata --workdir repgenr_download/tularensis --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
repgenr metadata --workdir repgenr_download/mallei --release 214.0 --version bac120 --dataset all --level species --target_genus burkholderia --target_species mallei --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
repgenr metadata --workdir repgenr_download/anthracis --release 214.0 --version bac120 --dataset all --level species --target_genus bacillus_A --target_species anthracis --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
- Download genome sequences:
repgenr genome --workdir repgenr_download/francisellaceae
repgenr genome --workdir repgenr_download/burkholderiaceae
repgenr genome --workdir repgenr_download/bacillaceae
repgenr genome --workdir repgenr_download/tularensis
repgenr genome --workdir repgenr_download/mallei
repgenr genome --workdir repgenr_download/anthracis
Alternative command which will loop over all folders inside the
repgenr_download
-folder:
find repgenr_download -mindepth 1 -maxdepth 1 -type d -exec repgenr genome --workdir {} \;
- Make a folder and fetch downloaded genomes (the
repgenr_download
-folder can then be removed):
mkdir genomes_downloaded
find repgenr_download -name "*.fasta" -exec mv {} genomes_downloaded/ \;
This example demonstrates how the derep-module can be executed multiple times with different ANI-settings and saving them in the intermission.
- Download Francisella tularensis genomes:
repgenr metadata --workdir tularensis --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis
repgenr genome --workdir tularensis
- Dereplicate the ~900 genomes using 99% ANI, divided into 3 batches to increase performance:
repgenr derep --workdir tularensis --secondary_ani 0.99 --process_size 300 --num_processes 3 --threads 70
Suppose this dereplication was too stringent and that a a lenient run needs to be computed.
- Before proceeding, save the 99% ANI run:
repgenr derep_stocker --save --name ANI_099
- Dereplicate the ~900 genomes using 95% ANI, divided into 3 batches to increase performance:
repgenr derep --workdir tularensis --secondary_ani 0.99 --process_size 300 --num_processes 3 --threads 70
- Save the 95% ANI run:
repgenr derep_stocker --save --name ANI_095
- Both runs exist in the saved stock
repgenr derep_stocker --list
- The 99% ANI run is loaded by:
repgenr derep_stocker --load --name ANI_099