Note 1: replace whatever is between <>
with the proper value. For example, in "Get HTS (High-throughput sequencing) data" <IDs_separated_by_space>
, write the IDs you will select (something like SRR494564 SRR497008 SRR628716
Note 2: if the VM has 16 CPUs, use 16
in CPUs/threads instead of 8
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.
In the VM
Create a folder where the data for scheme creation will be stored.
mkdir ~/scheme_creation_data
mkdir ~/scheme_creation_data/complete_genomes
mkdir ~/scheme_creation_data/hts
In your computer
In NCBI website:
- Select "Genome" in dropdown menu and search "Streptococcus agalactiae"
- On the top box, bellow "All XX genomes for species" section, click on "Browse the list"
- On "Levels" options, only select "Complete"
- Take note of the average genome size using "Size (Mb)" column. For Streptococcus agalactiae, 2.1 Mb will be used.
- Choose between 4-6 complete genomes to download: * For the selected genome, click on the green diamond under "FTP" column * Copy Link Location of the link ending with "_genomic.fna.gz" (ignore the one ending with "_rna_from_genomic.fna.gz") * In the VM:
# Change to directory where the data will be stored
# Only required to do once
cd ~/scheme_creation_data/complete_genomes
wget <>
# wget
# wget
# wget
# wget
# wget
# wget
In the VM
Uncompressed the downloaded complete genomes:
cd ~/scheme_creation_data/complete_genomes
gunzip *
In your computer
In ENA (European Nucleotide Archive) website:
- Search "Streptococcus agalactiae"
- On the left list, bellow "Read" section, click on "Run"
- Choose between 4-6 run accession IDs
* Select only Illumina paired end data, but produced with different sequencers modules (try HiSeq, MiSeq, NextSeq, Genome Analyzer II)
* Try IDs from different pages
* Select "WGS" (under "Library Strategy" information) and "GENOMIC" (under "Library Source" information) produced sequencing data
* Select samples with a maximum estimated depth of coverage of 200x
- Divide the number of sequenced nucleotides (under "Base Count" information) by the previously determined genome size in bp (for Streptococcus agalactiae, 2.1 Mb * 1000000)
In the VM
Get the data
# Create a file with the list of IDs to download
rm ~/scheme_creation_data/hts/ids.txt
for id in <IDs_separated_by_space>; do
echo $id >> ~/scheme_creation_data/hts/ids.txt
# for id in SRR494564 SRR497008 SRR628716 SRR755352 SRR3320580 SRR4414149; do echo $id >> ~/scheme_creation_data/hts/ids.txt; done
# Download data using getSeqENA --listENAids ~/scheme_creation_data/hts/ids.txt \
--outdir ~/scheme_creation_data/hts/ \
--asperaKey ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
--downloadLibrariesType PAIRED \
--downloadInstrumentPlatform ILLUMINA \
--threads 8 \
Assembly HTS data using INNUca
In the VM
- Change the following options accordingly with the species chosen:
"Streptococcus agalactiae" *--genomeSizeExpectedMb
# Run inside a screen
screen -S create_scheme
# INNUca
docker run --rm -u $(id -u):$(id -g) -it -v ~/scheme_creation_data:/data/ ummidock/innuca:3.1 \ --inputDirectory /data/hts/ \
--speciesExpected "Streptococcus agalactiae" \
--genomeSizeExpectedMb 2.1 \
--outdir /data/innuca/ \
--threads 8 \
# Detatch the screen
# Press Ctrl + A (release) and then D
# With 8 CPUs and Streptococcus example
# Runtime :0.0h:34.0m:45.3s