Skip to content

Pipeline to assemble paired-end sequencing reads, annotate the resulting contigs, compare the genome content across sequences and determine the variants (SNPs).

License

Notifications You must be signed in to change notification settings

judithbergada/Pipeline_GenomeAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genome Analysis pipeline

by Judith Bergadà Pijuan

This pipeline is aimed to perform the assembly and annotation of paired-end sequencing reads, as well as to compare the genome content of the given DNA sequences. In addition, it performs a variant calling analysis in order to detect the SNPs across sequences, and it also determines the spa type. Given multiple paired-end sequencing reads (FASTQ files), it provides a table file showing the genome content comparison, and (multiple) tables showing the SNPs detected across strains. Outputs have the same format as given by software Roary and Snippy. The pipeline also provides the de novo assembly of the sequencing reads and their annotation.

Installation

To use this pipeline, you need to install the following dependencies:

  • SPAdes
  • Prokka
  • Roary
  • Snippy
  • SpaTyper

Later, you need to download the tool:

cd $HOME
git clone https://github.com/judithbergada/Pipeline_GenomeAnalysis

Usage

The pipeline expects you to have the following folder:

  • FASTQ folder: this is a folder containing only your sequencing reads (FASTQ files). You must have all your FASTQ files here, and it is important that the pairs of files have the same prefix in the name.

To get information about the usage, please try:

./genomeanalysis.sh -h

The Genome Analysis tool can be used with these parameters:

Usage: genomeanalysis.sh    [-h or --help]
                            [-f or --fastqfolder]
                            [-o or --outname]
                            [-t or --threads]
                            [-r or --referencefolder]

Optional arguments:
    -h, --help:
                Show this help message and exit.
    -o, --outname:
                Name of your analysis.
                It will be used to name the output files.
                Default: mygenomes.
    -t, --threads:
                Number of threads that will be used.
                It must be an integer.
                Default: 8.
    -r, --referencefolder:
                Path to the folder that contains an external REFERENCE genome.
                The following files are needed: FASTA, GFF, GenBank.
Required arguments:
    -f, --fastqfolder:
                Path to the folder that contains ALL your FASTQ files.
                Only FASTQ files should be placed in it.
                You need forward and reverse paired-end reads.

Enjoy using the tool!

Genes Analysis option

This new command is part of the pipeline and is aimed to determine whether the mutations identified by the tool are synonymous or non-synonymous mutations. In addition, users can choose one or more genes of interest and the pipeline provides the whole amino acid sequences of these genes. It can be useful, for example, to compare the amino acid sequences using external tools such as Clustal-Omega.

Important: all outputs will be saved in the same folder as the outputs from the Genome Analysis pipeline, specifically into the subfolder named allSNPs.

Installation

To use this command, you don't need to install any dependencies.

However, please make sure you downloaded the last version of the tool:

cd $HOME
git clone https://github.com/judithbergada/Pipeline_GenomeAnalysis

Usage

The new command expects you to have the following folder:

  • Outputs folder: this is a folder containing the outputs from the genomeanalysis.sh pipeline. You must have all your output files here, and it is important not to delete or modify anything until all the analyses have been completed.

Furthermore, you need to provide the names of your genes of interest:

  • Specific genes: you can add one or more gene names, all of them within quotes and separated with a space. It is important that the gene names match the gene names from your annotation files. For this reason, it is recommended to get the information of the gene names from your "gene_presence_absence.csv" file (3rd column of the file, the column is named annotation). This file is located in the GenomeContent folder.

To get information about the usage, please try:

./genesanalysis.sh -h

The new command (Genes Analysis) can be used with these parameters:

Usage: genesanalysis.sh   [-h or --help]
                          [-s or --specific_genes]
                          [-o or --outputs_folder]
                          [-t or --threads]

Optional arguments:
    -h, --help:
                Show this help message and exit.
    -t, --threads:
                Number of threads that will be used.
                It must be an integer.
                Default: 8.
Required arguments:
    -s, --specific_genes:
                Names of the genes that you want to compare.
                Write all of them within quotes and separated with a space.
                E.g.: 'murF fabZ rplL'.
    -o, --outputs_folder:
                Path to the folder that contains ALL your outputs from
                the genomeanalysis.sh pipeline.

Enjoy using the new command!

About

Pipeline to assemble paired-end sequencing reads, annotate the resulting contigs, compare the genome content across sequences and determine the variants (SNPs).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages