Skip to content

Configuration parameters

Freya Arthen edited this page Mar 3, 2022 · 23 revisions

Minimal required information

  • fasta_path path to genomic FASTA file
  • gff_path path to GFF file
  • output_path path to output directory
  • taxon_id NCBI taxonomy ID of the query species
  • database_path path to diamond-formatted NCBI NR protein database (set this up according to the instructions in Installation)

General options

  • threads: X/"auto" number of threads to be used by Bowtie2 and DIAMOND
    • default: 'auto' → auto-detection of all available cores by DIAMOND (Bowtie2 uses one thread)
    • X → X threads are used by DIAMOND and Bowtie2

Coverage options

  • include_coverage: TRUE/FALSE explicitly include coverage information in the analysis or not
    • default: inferred from existence of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
  • compute_coverage: TRUE/FALSE compute per base coverage file ('pbc_paths') based on the files provided in 'reads' or 'bam'
    • default: inferred from of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
  • pbc_path_X path to file specifying the per base coverage (PBC) for coverage set X
    • can be either specified by user or default is set if coverage information X is available
    • default: 'output_path/pbc_X.txt'
  • bam_path_X path to BAM file for coverage set X
    • can be either specified by user or default is set if coverage information X is available
    • default: 'output_path/mapping_sorted_X.bam'
  • read_paths_X path to read file(s) in FASTA format for coverage set X
    • paired-end reads → state paths as comma-separated list in squared brackets
  • min_insert_X minimum insert size for paired-end reads (size including reads)
    • default: 0
  • max_insert_X maximum insert size for paired-end reads (size including reads)
    • default: 500
  • read_orientation_X: "fr"/"rf"/"ff" orientation of the read pairs
    • default: "fr" → read 1 has forward orientation, read 2 is reverse orientated (Illumina)
    • "rf" → read 1 has reverse orientation, read 2 is forward orientated (PacBio)
    • "ff" → read 1 and 2 have forward orientation

Taxonomic assignment options

  • compute_tax_assignment: TRUE/FALSE run sequence similarity search with longest representative protein of genes with Diamond
    • default: inferred from existence of file(s) at 'tax_assignment_path'
  • extract_proteins: TRUE/FALSE automatic generation of protein FASTA file based on genomic FASTA and GFF; saved to 'proteins_path'
    • currently only possible with default GFF gene and CDS features!
    • default: inferred from existence of file at 'proteins_path'
  • proteins_path path to FASTA file containing the protein sequences
    • will automatically be generated on non-existence (or 'extract_proteins' == TRUE)
    • can be either specified by user or default is set
    • default: 'output_path/proteins.faa'
  • tax_assignment_path hit file(s) of sequence similarity search in database
    • when 'assignment_mode' == 'quick' and only one path is provided, the suffixe '_1' and '_2' are added; to state both files specifically, give as comma-separated list in brackets
    • can be either specified by user or default is set
    • default: 'output_path/taxonomic_hits.txt' / ['output_path/taxonomic_hits_1.txt', 'output_path/taxonomic_hits_2.txt']
  • taxon_exclude: TRUE/FALSE allow self-hits in similarity search (query taxon is either in- or excluded)
    • default: TRUE
  • exclusion_rank: <rank> taxonomic rank at which hits are excluded in taxonomic assignment (based on the query species)
    • taxa which are in the same <exclusion_rank> as the query species are discarded from taxonomic assignment
    • default: 'species'
  • assignment_mode: "exhaustive"/"quick" mode in which to perform similarity search
    • "exhaustive" → default mode
    • "quick" → speed up of similarity search - genes with origin most likely in query species are identified by doing an inital search in small subset of database, other genes are then forwarded to search in whole database
    • default: 'exhaustive'
  • quick_mode_search_rank taxonomic rank at which to create the subset of the database for inital filtering search
    • can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
    • default: 'kingdom'
  • quick_mode_match_rank taxonomic rank which taxonomic assignment of genes has to reach to be accepted in first search, i.e. be identified as belonging to the query species
    • can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
    • default: 'order'

Plot output options

  • update_plots: TRUE/FALSE update plots only
    • default: FALSE
  • num_groups_plot: x/"all" number of distinct taxonomic groups to display in the plots
    • x → only x labels are displayed; taxonomic assignments are iteratively merged to higher ranks until number is exhausted
    • "all" → every taxonomic assignment is displayed
    • default: 25
  • merging_labels: <NCBI IDs>/<rank>/<rank>-all merging of taxonomic assignments can be manually influenced
    • NCBI IDs → comma-separated list of NCBI taxon IDs; taxonomic assignments are merged at each of these IDs (please make sure the IDs are not within the same lineage)
    • <rank> → a taxonomic rank; taxon to merge taxonomic assignments at will be inferred from rank for the query species
    • <rank>-all → a taxonomic rank with suffix '-all'; all taxonomic assignments will be generalized to this rank
    • default: None
  • output_pdf: TRUE/FALSE save plots in PDF format
    • default: TRUE
  • output_png: TRUE/FALSE save plots in PNG format
    • default: FALSE

Gene info options

  • include_pseudogenes: TRUE/FALSE include pseudogenes in the analysis
    • default: FALSE
  • gff_source: "default"/"maker"/"augustus_masked"/<path> select the source where gene and protein information will be picked from in the GFF; underlying is the information on how to match the FASTA headers of the proteins to the gene IDs
    • "default" → information about genes and proteins is retrieved from all gene, mRNA and CDS features
    • "maker" → source type "maker" is used
    • "augustus_masked" → source type "augustus_masked" is used
    • <path> → create own rule on how to parse gene and protein information from GFF and match gene ID and FASTA header; see Additional information for details

PCA options

  • input_variables variables to be used for the PCA
    • comma-separated list of variables, no spaces, whole list put in quotes ('" "')
    • default: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"
    • see Additional information for details on options
  • perform_parallel_analysis: TRUE/FALSE perform parallel analysis to determine the best number of principal components to be retained (overwrites 'num_pcs')
    • default: FALSE
  • num_pcs number of principal components to be retained for clustering and plotting (overwritten when 'perform_parallel_analysis' = TRUE)
  • set number to at least 3 to genereate interactive 3D plot
  • default: 3
  • coverage_cutoff_mode: "default"/"contamination"/"transposons" coverage cutoff modes to exclude genes with abnormal coverage
    • "default" → all genes are used in PCA
    • "contamination" → only genes with coverage below the median are used for PCA
    • "transposons" → ony genes with coverage above the median are used in PCA

Clustering options

  • perform_kmeans: TRUE/FALSE perform k-means clustering
    • default: FALSE
  • kmeans_k: k/"default" number of clusters to generate with k-means clustering
    • "default" → clustering reports for k = 2, 3 and 4
    • k → custom number of clusters, i.e. k cluster
  • perform_hclust: TRUE/FALSE perform hierarchical clustering
    • default: FALSE
  • hclust_k: k/"default" number of clusters to generate with hierarchical clustering
    • "default" → clustering reports for 2, 3 and 4 groups
    • k → custom number of clusters, i.e. k cluster
  • perform_mclust: TRUE/FALSE perform model-based clustering
    • default: FALSE
  • mclust_k: k/"default"/"BIC" number of clusters to generate with model-based clustering
    • "default" → clustering reports for 2, 3 and 4 groups
    • "BIC" → optimal number of clusters based on Bayesian Information Criterion will be automatically chosen
    • k → custom number of clusters, i.e. k cluster
  • perform_dbscan: TRUE/FALSE perform DBSCAN clustering
    • default: FALSE
  • dbscan_groups: "default"/"custom" number of clusters to generate with DBSCAN clustering
    • "default" → six runs of DBSCAN with different parameter combinations of epsilon and minimum number of points
    • "custom" → customize epsilon and minimum number of points; specify 'custom_eps' and 'custom_minPts', e.g.
  • custom_eps custom value for epsilon in DBSCAN clustering
  • custom_minPts custom minimum number of points in DBSCAN clustering
Clone this wiki locally