From reads to assembly: working with INNUca pipeline

Note 1: replace whatever is between <> with the proper value. For example, in "Get HTS (High-throughput sequencing) data" <IDs_separated_by_space>, write the IDs you will select (something like SRR494564 SRR497008 SRR628716).
Note 2: if the VM has 16 CPUs, use 16 in CPUs/threads instead of 8.
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.

Get genomic data

Organize data for scheme creation

In the VM

Create a folder where the data for scheme creation will be stored.

mkdir ~/scheme_creation_data
mkdir ~/scheme_creation_data/complete_genomes
mkdir ~/scheme_creation_data/hts

Get complete genomes

In your computer

In NCBI website:

  1. Select "Genome" in dropdown menu and search "Streptococcus agalactiae"
  2. On the top box, bellow "All XX genomes for species" section, click on "Browse the list"
  3. On "Levels" options, only select "Complete"
  4. Take note of the average genome size using "Size (Mb)" column. For Streptococcus agalactiae, 2.1 Mb will be used.
  5. Choose between 4-6 complete genomes to download: * For the selected genome, click on the green diamond under "FTP" column * Copy Link Location of the link ending with "_genomic.fna.gz" (ignore the one ending with "_rna_from_genomic.fna.gz") * In the VM:
# Change to directory where the data will be stored
# Only required to do once

cd ~/scheme_creation_data/complete_genomes

wget <>

# wget
# wget
# wget
# wget
# wget
# wget

In the VM

Uncompressed the downloaded complete genomes:

cd ~/scheme_creation_data/complete_genomes
gunzip *

Get HTS (High-throughput sequencing) data

In your computer

In ENA (European Nucleotide Archive) website:

  1. Search "Streptococcus agalactiae"
  2. On the left list, bellow "Read" section, click on "Run"
  3. Choose between 4-6 run accession IDs * Select only Illumina paired end data, but produced with different sequencers modules (try HiSeq, MiSeq, NextSeq, Genome Analyzer II) * Try IDs from different pages * Select "WGS" (under "Library Strategy" information) and "GENOMIC" (under "Library Source" information) produced sequencing data * Select samples with a maximum estimated depth of coverage of 200x
    • Divide the number of sequenced nucleotides (under "Base Count" information) by the previously determined genome size in bp (for Streptococcus agalactiae, 2.1 Mb * 1000000)

In the VM

Get the data

# Create a file with the list of IDs to download
rm ~/scheme_creation_data/hts/ids.txt

for id in <IDs_separated_by_space>; do
  echo $id >> ~/scheme_creation_data/hts/ids.txt

# for id in SRR494564 SRR497008 SRR628716 SRR755352 SRR3320580 SRR4414149; do echo $id >> ~/scheme_creation_data/hts/ids.txt; done

# Download data using getSeqENA --listENAids ~/scheme_creation_data/hts/ids.txt \
             --outdir ~/scheme_creation_data/hts/ \
             --asperaKey  ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
             --downloadLibrariesType PAIRED \
             --downloadInstrumentPlatform ILLUMINA \
             --threads 8 \

Assembly HTS data

Assembly HTS data using INNUca

In the VM

  • Change the following options accordingly with the species chosen: * --speciesExpected "Streptococcus agalactiae" * --genomeSizeExpectedMb 2.1
# Run inside a screen
screen -S create_scheme

# INNUca
docker run --rm -u $(id -u):$(id -g) -it -v ~/scheme_creation_data:/data/ ummidock/innuca:3.1 \ --inputDirectory /data/hts/ \
                 --speciesExpected "Streptococcus agalactiae" \
                 --genomeSizeExpectedMb 2.1 \
                 --outdir /data/innuca/ \
                 --threads 8 \

# Detatch the screen
# Press Ctrl + A (release) and then D

# With 8 CPUs and Streptococcus example
# Runtime :0.0h:34.0m:45.3s