-
Notifications
You must be signed in to change notification settings - Fork 2
Configuration parameters
Freya Arthen edited this page Mar 3, 2022
·
23 revisions
-
fasta_path
path to genomic FASTA file -
gff_path
path to GFF file -
output_path
path to output directory -
taxon_id
NCBI taxonomy ID of the query species -
database_path
path to diamond-formatted NCBI NR protein database (set this up according to the instructions in Installation)
-
threads: X/"auto"
number of threads to be used by Bowtie2 and DIAMOND- default: 'auto' → auto-detection of all available cores by DIAMOND (Bowtie2 uses one thread)
- X → X threads are used by DIAMOND and Bowtie2
-
include_coverage: TRUE/FALSE
explicitly include coverage information in the analysis or not- default: inferred from existence of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
-
compute_coverage: TRUE/FALSE
compute per base coverage file ('pbc_paths') based on the files provided in 'reads' or 'bam'- default: inferred from of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
-
pbc_path_X
path to file specifying the per base coverage (PBC) for coverage set X- can be either specified by user or default is set if coverage information X is available
- default: 'output_path/pbc_X.txt'
-
bam_path_X
path to BAM file for coverage set X- can be either specified by user or default is set if coverage information X is available
- default: 'output_path/mapping_sorted_X.bam'
-
read_paths_X
path to read file(s) in FASTA format for coverage set X- paired-end reads → state paths as comma-separated list in squared brackets
-
min_insert_X
minimum insert size for paired-end reads (size including reads)- default: 0
-
max_insert_X
maximum insert size for paired-end reads (size including reads)- default: 500
-
read_orientation_X: "fr"/"rf"/"ff"
orientation of the read pairs- default: "fr" → read 1 has forward orientation, read 2 is reverse orientated (Illumina)
- "rf" → read 1 has reverse orientation, read 2 is forward orientated (PacBio)
- "ff" → read 1 and 2 have forward orientation
-
compute_tax_assignment: TRUE/FALSE
run sequence similarity search with longest representative protein of genes with Diamond- default: inferred from existence of file(s) at 'tax_assignment_path'
-
extract_proteins: TRUE/FALSE
automatic generation of protein FASTA file based on genomic FASTA and GFF; saved to 'proteins_path'- currently only possible with default GFF gene and CDS features!
- default: inferred from existence of file at 'proteins_path'
-
proteins_path
path to FASTA file containing the protein sequences- will automatically be generated on non-existence (or 'extract_proteins' == TRUE)
- can be either specified by user or default is set
- default: 'output_path/proteins.faa'
-
tax_assignment_path
hit file(s) of sequence similarity search in database- when 'assignment_mode' == 'quick' and only one path is provided, the suffixe '_1' and '_2' are added; to state both files specifically, give as comma-separated list in brackets
- can be either specified by user or default is set
- default: 'output_path/taxonomic_hits.txt' / ['output_path/taxonomic_hits_1.txt', 'output_path/taxonomic_hits_2.txt']
-
taxon_exclude: TRUE/FALSE
allow self-hits in similarity search (query taxon is either in- or excluded)- default: TRUE
-
exclusion_rank: <rank>
taxonomic rank at which hits are excluded in taxonomic assignment (based on the query species)- taxa which are in the same <exclusion_rank> as the query species are discarded from taxonomic assignment
- default: 'species'
-
assignment_mode: "exhaustive"/"quick"
mode in which to perform similarity search- "exhaustive" → default mode
- "quick" → speed up of similarity search - genes with origin most likely in query species are identified by doing an inital search in small subset of database, other genes are then forwarded to search in whole database
- default: 'exhaustive'
-
quick_mode_search_rank
taxonomic rank at which to create the subset of the database for inital filtering search- can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
- default: 'kingdom'
-
quick_mode_match_rank
taxonomic rank which taxonomic assignment of genes has to reach to be accepted in first search, i.e. be identified as belonging to the query species- can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
- default: 'order'
-
update_plots: TRUE/FALSE
update plots only- default: FALSE
-
num_groups_plot: x/"all"
number of distinct taxonomic groups to display in the plots- x → only x labels are displayed; taxonomic assignments are iteratively merged to higher ranks until number is exhausted
- "all" → every taxonomic assignment is displayed
- default: 25
-
merging_labels: <NCBI IDs>/<rank>/<rank>-all
merging of taxonomic assignments can be manually influenced- NCBI IDs → comma-separated list of NCBI taxon IDs; taxonomic assignments are merged at each of these IDs (please make sure the IDs are not within the same lineage)
- <rank> → a taxonomic rank; taxon to merge taxonomic assignments at will be inferred from rank for the query species
- <rank>-all → a taxonomic rank with suffix '-all'; all taxonomic assignments will be generalized to this rank
- default: None
-
output_pdf: TRUE/FALSE
save plots in PDF format- default: TRUE
-
output_png: TRUE/FALSE
save plots in PNG format- default: FALSE
-
include_pseudogenes: TRUE/FALSE
include pseudogenes in the analysis- default: FALSE
-
gff_source: "default"/"maker"/"augustus_masked"/<path>
select the source where gene and protein information will be picked from in the GFF; underlying is the information on how to match the FASTA headers of the proteins to the gene IDs- "default" → information about genes and proteins is retrieved from all gene, mRNA and CDS features
- "maker" → source type "maker" is used
- "augustus_masked" → source type "augustus_masked" is used
- <path> → create own rule on how to parse gene and protein information from GFF and match gene ID and FASTA header; see Additional information for details
-
input_variables
variables to be used for the PCA- comma-separated list of variables, no spaces, whole list put in quotes ('" "')
- default: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"
- see Additional information for details on options
-
perform_parallel_analysis: TRUE/FALSE
perform parallel analysis to determine the best number of principal components to be retained (overwrites 'num_pcs')- default: FALSE
-
num_pcs
number of principal components to be retained for clustering and plotting (overwritten when 'perform_parallel_analysis' = TRUE) - set number to at least 3 to genereate interactive 3D plot
- default: 3
-
coverage_cutoff_mode: "default"/"contamination"/"transposons"
coverage cutoff modes to exclude genes with abnormal coverage- "default" → all genes are used in PCA
- "contamination" → only genes with coverage below the median are used for PCA
- "transposons" → ony genes with coverage above the median are used in PCA
-
perform_kmeans: TRUE/FALSE
perform k-means clustering- default: FALSE
-
kmeans_k: k/"default"
number of clusters to generate with k-means clustering- "default" → clustering reports for k = 2, 3 and 4
- k → custom number of clusters, i.e. k cluster
-
perform_hclust: TRUE/FALSE
perform hierarchical clustering- default: FALSE
-
hclust_k: k/"default"
number of clusters to generate with hierarchical clustering- "default" → clustering reports for 2, 3 and 4 groups
- k → custom number of clusters, i.e. k cluster
-
perform_mclust: TRUE/FALSE
perform model-based clustering- default: FALSE
-
mclust_k: k/"default"/"BIC"
number of clusters to generate with model-based clustering- "default" → clustering reports for 2, 3 and 4 groups
- "BIC" → optimal number of clusters based on Bayesian Information Criterion will be automatically chosen
- k → custom number of clusters, i.e. k cluster
-
perform_dbscan: TRUE/FALSE
perform DBSCAN clustering- default: FALSE
-
dbscan_groups: "default"/"custom"
number of clusters to generate with DBSCAN clustering- "default" → six runs of DBSCAN with different parameter combinations of epsilon and minimum number of points
- "custom" → customize epsilon and minimum number of points; specify 'custom_eps' and 'custom_minPts', e.g.
-
custom_eps
custom value for epsilon in DBSCAN clustering -
custom_minPts
custom minimum number of points in DBSCAN clustering