Skip to content

3a. RepGenR stand‐alone examples

jaclew edited this page Nov 21, 2023 · 2 revisions

Full workflow: A dereplicated and raw Francisella tularensis dataset

This example shows a full workflow of RepGenR.

Dereplicated

  • Downloading metadata, selecting Francisella tularensis:
repgenr metadata --workdir tularensis --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis
  • Download genomes (almost 900 genomes):
repgenr genome --workdir tularensis
  • Dereplicate the ~900 genomes using 99% ANI, divided into 3 batches to increase performance:
repgenr derep --workdir tularensis --secondary_ani 0.99 --process_size 300 --num_processes 3 --threads 70
  • Compute phylogeny of dereplicated genomes using the accurate method:
repgenr phylo --workdir tularensis --mode accurate

image Figure showing phylogeny of dereplicated Francisella tularensis datasets.

  • Output the parent-child relations-file:
repgenr tree2tax --workdir tularensis

Content of parent-child relations-file (branch-nodes named as the hash of their descendants):

child                                                    parent
Francisellaceae_Francisella_sp002095075_GCA_002095075.2  root
Francisellaceae_Francisella_tularensis_GCF_001870885.1   0f7a3c69c7a74e54e710b4bcaad38c34
0f7a3c69c7a74e54e710b4bcaad38c34                         root
Francisellaceae_Francisella_tularensis_GCF_000742085.1   d5e86a1f8635b57f94389d046c00ea0c
d5e86a1f8635b57f94389d046c00ea0c                         0f7a3c69c7a74e54e710b4bcaad38c34
Francisellaceae_Francisella_tularensis_GCF_000168775.2   b75d4f08baae50cddf3df8a1395b76dc
b75d4f08baae50cddf3df8a1395b76dc                         7b8d9247995e31ea51977a27d2720532
7b8d9247995e31ea51977a27d2720532                         d5e86a1f8635b57f94389d046c00ea0c
Francisellaceae_Francisella_tularensis_GCF_016604535.1   f7a6b750e06844ffe53f0003cca9c32f
f7a6b750e06844ffe53f0003cca9c32f                         7b8d9247995e31ea51977a27d2720532
Francisellaceae_Francisella_tularensis_GCF_000014645.1   898615f9a2bedbd8a8e468c1be231cb0
898615f9a2bedbd8a8e468c1be231cb0                         b75d4f08baae50cddf3df8a1395b76dc
Francisellaceae_Francisella_tularensis_GCF_002952075.1   898615f9a2bedbd8a8e468c1be231cb0
Francisellaceae_Francisella_tularensis_GCF_001865695.1   c15e5c8f34dbf0e7e2ba05c7d7322e15
c15e5c8f34dbf0e7e2ba05c7d7322e15                         f7a6b750e06844ffe53f0003cca9c32f
Francisellaceae_Francisella_tularensis_GCF_000195535.1   c2b295fdb439928d438e3466599181d0
c2b295fdb439928d438e3466599181d0                         c15e5c8f34dbf0e7e2ba05c7d7322e15
Francisellaceae_Francisella_tularensis_GCF_000833355.1   c2b295fdb439928d438e3466599181d0

Raw (no dereplication)

  • Downloading metadata, selecting Francisella tularensis:
repgenr metadata --workdir tularensis_full --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis
  • Download genomes:
repgenr genome --workdir tularensis_full
  • Compute phylogeny of all genomes using the accurate method:
repgenr phylo --workdir tularensis_full --mode accurate --all_genomes
  • Output the parent-child relations-file:
repgenr tree2tax --workdir tularensis_full --all_genomes

Partial workflow: multiple families and species

In this example, RepGenR is used to obtain conveniently named files for multiple families (Francisellaceae, Burkholderiaceae, and Bacillaceae) and selected species (tularensis, mallei, and anthracis) within the families. For families, "representative" genomes are selected and for species, "all" genomes are selected. Please see GTDB taxonomy browser at (https://gtdb.ecogenomic.org/) to identify taxa.

  • Make a work-directory for RepGenR:
mkdir repgenr_download
  • Download Francisellaceae representative-genomes:
repgenr metadata --workdir repgenr_download/francisellaceae --release 214.0 --version bac120 --dataset rep --level family --target_family francisellaceae
  • Download Burkholderiaceae and Bacillaceae representative-genome metadata (re-using the previously downloaded metadata-file from GTDB in the above command, by specifying --metadata_path):
repgenr metadata --workdir repgenr_download/burkholderiaceae --release 214.0 --version bac120 --dataset rep --level family --target_family burkholderiaceae --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
repgenr metadata --workdir repgenr_download/bacillaceae --release 214.0 --version bac120 --dataset rep --level family --target_family bacillaceae --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
  • Download all tularensis, mallei, and anthracis genome metadata (re-using the previously downloaded metadata-file):
repgenr metadata --workdir repgenr_download/tularensis --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
repgenr metadata --workdir repgenr_download/mallei --release 214.0 --version bac120 --dataset all --level species --target_genus burkholderia --target_species mallei --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
repgenr metadata --workdir repgenr_download/anthracis --release 214.0 --version bac120 --dataset all --level species --target_genus bacillus_A --target_species anthracis --metadata_path repgenr_download/francisellaceae/bac120_metadata_r214.tar.gz
  • Download genome sequences:
repgenr genome --workdir repgenr_download/francisellaceae
repgenr genome --workdir repgenr_download/burkholderiaceae
repgenr genome --workdir repgenr_download/bacillaceae
repgenr genome --workdir repgenr_download/tularensis
repgenr genome --workdir repgenr_download/mallei
repgenr genome --workdir repgenr_download/anthracis

Alternative command which will loop over all folders inside the repgenr_download-folder:
find repgenr_download -mindepth 1 -maxdepth 1 -type d -exec repgenr genome --workdir {} \;

  • Make a folder and fetch downloaded genomes (the repgenr_download-folder can then be removed):
mkdir genomes_downloaded
find repgenr_download -name "*.fasta" -exec mv {} genomes_downloaded/ \;

Multiple dereplication runs with saving

This example demonstrates how the derep-module can be executed multiple times with different ANI-settings and saving them in the intermission.

  • Download Francisella tularensis genomes:
repgenr metadata --workdir tularensis --release 214.0 --version bac120 --dataset all --level species --target_genus francisella --target_species tularensis
repgenr genome --workdir tularensis
  • Dereplicate the ~900 genomes using 99% ANI, divided into 3 batches to increase performance:
repgenr derep --workdir tularensis --secondary_ani 0.99 --process_size 300 --num_processes 3 --threads 70

Suppose this dereplication was too stringent and that a a lenient run needs to be computed.

  • Before proceeding, save the 99% ANI run:
repgenr derep_stocker --save --name ANI_099
  • Dereplicate the ~900 genomes using 95% ANI, divided into 3 batches to increase performance:
repgenr derep --workdir tularensis --secondary_ani 0.99 --process_size 300 --num_processes 3 --threads 70
  • Save the 95% ANI run:
repgenr derep_stocker --save --name ANI_095
  • Both runs exist in the saved stock
repgenr derep_stocker --list
  • The 99% ANI run is loaded by:
repgenr derep_stocker --load --name ANI_099