Configuration parameters

Minimal required information

fasta_path path to genomic FASTA file
gff_path path to GFF file
output_path path to output directory
taxon_id NCBI taxonomy ID of the query species
database_path path to diamond-formatted NCBI NR protein database (set this up according to the instructions in Installation)

General options

threads: X/"auto" number of threads to be used by Bowtie2 and DIAMOND
- default: 'auto' → auto-detection of all available cores by DIAMOND (Bowtie2 uses one thread)
- X → X threads are used by DIAMOND and Bowtie2

Coverage options

include_coverage: TRUE/FALSE explicitly include coverage information in the analysis or not
- default: inferred from existence of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
compute_coverage: TRUE/FALSE compute per base coverage file ('pbc_paths') based on the files provided in 'reads' or 'bam'
- default: inferred from of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
pbc_path_X path to file specifying the per base coverage (PBC) for coverage set X
- can be either specified by user or default is set if coverage information X is available
- default: 'output_path/pbc_X.txt'
bam_path_X path to BAM file for coverage set X
- can be either specified by user or default is set if coverage information X is available
- default: 'output_path/mapping_sorted_X.bam'
read_paths_X path to read file(s) in FASTA format for coverage set X
- paired-end reads → state paths as comma-separated list in squared brackets
min_insert_X minimum insert size for paired-end reads (size including reads)
- default: 0
max_insert_X maximum insert size for paired-end reads (size including reads)
- default: 500
read_orientation_X: "fr"/"rf"/"ff" orientation of the read pairs
- default: "fr" → read 1 has forward orientation, read 2 is reverse orientated (Illumina)
- "rf" → read 1 has reverse orientation, read 2 is forward orientated (PacBio)
- "ff" → read 1 and 2 have forward orientation

Taxonomic assignment options

compute_tax_assignment: TRUE/FALSE run sequence similarity search with longest representative protein of genes with Diamond
- default: inferred from existence of file(s) at 'tax_assignment_path'
extract_proteins: TRUE/FALSE automatic generation of protein FASTA file based on genomic FASTA and GFF; saved to 'proteins_path'
- currently only possible with default GFF gene and CDS features!
- default: inferred from existence of file at 'proteins_path'
proteins_path path to FASTA file containing the protein sequences
- will automatically be generated on non-existence (or 'extract_proteins' == TRUE)
- can be either specified by user or default is set
- default: 'output_path/proteins.faa'
tax_assignment_path hit file(s) of sequence similarity search in database
- when 'assignment_mode' == 'quick' and only one path is provided, the suffixe '_1' and '_2' are added; to state both files specifically, give as comma-separated list in brackets
- can be either specified by user or default is set
- default: 'output_path/taxonomic_hits.txt' / ['output_path/taxonomic_hits_1.txt', 'output_path/taxonomic_hits_2.txt']
taxon_exclude: TRUE/FALSE allow self-hits in similarity search (query taxon is either in- or excluded)
- default: TRUE
exclusion_rank: <rank> taxonomic rank at which hits are excluded in taxonomic assignment (based on the query species)
- taxa which are in the same <exclusion_rank> as the query species are discarded from taxonomic assignment
- default: 'species'
assignment_mode: "exhaustive"/"quick" mode in which to perform similarity search
- "exhaustive" → default mode
- "quick" → speed up of similarity search - genes with origin most likely in query species are identified by doing an inital search in small subset of database, other genes are then forwarded to search in whole database
- default: 'exhaustive'
quick_mode_search_rank taxonomic rank at which to create the subset of the database for inital filtering search
- can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
- default: 'kingdom'
quick_mode_match_rank taxonomic rank which taxonomic assignment of genes has to reach to be accepted in first search, i.e. be identified as belonging to the query species
- can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
- default: 'order'

Plot output options

update_plots: TRUE/FALSE update plots only
- default: FALSE
num_groups_plot: x/"all" number of distinct taxonomic groups to display in the plots
- x → only x labels are displayed; taxonomic assignments are iteratively merged to higher ranks until number is exhausted
- "all" → every taxonomic assignment is displayed
- default: 25
merging_labels: <NCBI IDs>/<rank>/<rank>-all merging of taxonomic assignments can be manually influenced
- NCBI IDs → comma-separated list of NCBI taxon IDs; taxonomic assignments are merged at each of these IDs (please make sure the IDs are not within the same lineage)
- <rank> → a taxonomic rank; taxon to merge taxonomic assignments at will be inferred from rank for the query species
- <rank>-all → a taxonomic rank with suffix '-all'; all taxonomic assignments will be generalized to this rank
- default: None
output_pdf: TRUE/FALSE save plots in PDF format
- default: TRUE
output_png: TRUE/FALSE save plots in PNG format
- default: FALSE

Gene info options

include_pseudogenes: TRUE/FALSE include pseudogenes in the analysis
- default: FALSE
gff_source: "default"/"maker"/"augustus_masked"/<path> select the source where gene and protein information will be picked from in the GFF; underlying is the information on how to match the FASTA headers of the proteins to the gene IDs
- "default" → information about genes and proteins is retrieved from all gene, mRNA and CDS features
- "maker" → source type "maker" is used
- "augustus_masked" → source type "augustus_masked" is used
- <path> → create own rule on how to parse gene and protein information from GFF and match gene ID and FASTA header; see Additional information for details

PCA options

input_variables variables to be used for the PCA
- comma-separated list of variables, no spaces, whole list put in quotes ('" "')
- default: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"
- see Additional information for details on options
perform_parallel_analysis: TRUE/FALSE perform parallel analysis to determine the best number of principal components to be retained (overwrites 'num_pcs')
- default: FALSE
num_pcs number of principal components to be retained for clustering and plotting (overwritten when 'perform_parallel_analysis' = TRUE)
set number to at least 3 to genereate interactive 3D plot
default: 3
coverage_cutoff_mode: "default"/"contamination"/"transposons" coverage cutoff modes to exclude genes with abnormal coverage
- "default" → all genes are used in PCA
- "contamination" → only genes with coverage below the median are used for PCA
- "transposons" → ony genes with coverage above the median are used in PCA

Clustering options

perform_kmeans: TRUE/FALSE perform k-means clustering
- default: FALSE
kmeans_k: k/"default" number of clusters to generate with k-means clustering
- "default" → clustering reports for k = 2, 3 and 4
- k → custom number of clusters, i.e. k cluster
perform_hclust: TRUE/FALSE perform hierarchical clustering
- default: FALSE
hclust_k: k/"default" number of clusters to generate with hierarchical clustering
- "default" → clustering reports for 2, 3 and 4 groups
- k → custom number of clusters, i.e. k cluster
perform_mclust: TRUE/FALSE perform model-based clustering
- default: FALSE
mclust_k: k/"default"/"BIC" number of clusters to generate with model-based clustering
- "default" → clustering reports for 2, 3 and 4 groups
- "BIC" → optimal number of clusters based on Bayesian Information Criterion will be automatically chosen
- k → custom number of clusters, i.e. k cluster
perform_dbscan: TRUE/FALSE perform DBSCAN clustering
- default: FALSE
dbscan_groups: "default"/"custom" number of clusters to generate with DBSCAN clustering
- "default" → six runs of DBSCAN with different parameter combinations of epsilon and minimum number of points
- "custom" → customize epsilon and minimum number of points; specify 'custom_eps' and 'custom_minPts', e.g.
custom_eps custom value for epsilon in DBSCAN clustering
custom_minPts custom minimum number of points in DBSCAN clustering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Configuration parameters

Minimal required information

General options

Coverage options

Taxonomic assignment options

Plot output options

Gene info options

PCA options

Clustering options

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally