Skip to content
Freya Arthen edited this page Mar 3, 2022 · 6 revisions

gene_info

  • summary.txt Lists the main assembly metrics (i.e. numbers as well as mean, SD, min and max of length, GC content and coverage) on contig and gene level
  • raw_gene_table.txt|.csv Contains all variables with values as they have been computed on the input data set (for detailed description of each variable, see this table
  • imputed_gene_table.txt|.csv The same as raw_gene_table.txt|.csv except for the variables 'c_genecovsd', 'c_genelensd', 'g_covdev_c', 'g_gcdev_c', 'g_lendev_c' which are rescaled to a range of 0 to 1 NaN (= missing) values are imputed with the mean of the respective variable

taxonomic_assignment

The label in the plots, which represents the query species, is automatically determined and always colored in a dark grey

  • 3D_plot.html Interactive 3D scatterplot to examine genes and their taxonomic assignments
    • with single-clicks on labels you can hide individual groups
    • double-clicks hide every group except for the one that was clicked
    • hovering over data points shows additional information
    • the subdirectory “3D_plot_files” holds additional files for this plot and is required to display the plot (important when working with MobaXterm for example)
  • density_x|y|z.png|.pdf Shows the density in the dot plot of axis x, y and z.
    • Note: the density of a single dot can’t be computed. Thus, groups of single genes can not be displayed in the 1D density plots
  • density_2d.png|.pdf Like 2D scatterplot, but taxonomic group of query species is represented as 2d density
  • gene_table_taxon_assignment.csv raw_gene_table with PCA coordinates for each gene and their taxonomic assignment appended
    • this is a tabular representation of all information that is displayed in the 3D plot
    • see Additional information for details on the contained information

PCA_and_clustering

  • variables_excluded_from_PCA_and_clustering.txt Lists all variables that were excluded from PCS analysis due to containing more than 30% NaN values
  • genes_excluded_from_PCA_and_clustering.csv Genes still containing NaNs after dropping variables and thus being excluded from the analysis
  • gene_table_coords.csv 'raw_gene_table' with PCA coordinates of genes appended (required for 'plotting.R')

PCA results

  • contribution_of_variables.png|.pdf Figure illustrating how much each variable contributes to the first two principal components
  • genes_and_variables.png|.pdf Biplot of variables (vectors) and genes (points) in the new coordinate system defined by the first two principal components. Transparency represents the amount of contribution to the principal components
  • pca_loadings.csv Table listing the loadings of the original variables (rows) on the computed principal components (columns)
  • pca_summary.csv Table listing standard deviation, proportion of explained variance and cumulative proportion of explained variance in the original data for each of the principal components
  • scree_plot.png|.pdf Scree plot visualising the amount of variance in the original data that is explained by each of the principal components (here: dimensions)
  • parallel_analysis.png|.pdf Only available if parallel analysis was performed on the principal components. Results of Horn’s parallel analysis: plotting random eigenvalues for the given number of PCs, adjusted and unadjusted eigenvalues, indicating which one were retained for the subsequent PCA

Clustering

In addition to the PCA results, the script will output one directory for each clustering approach:

DBSCAN_clustering
hierarchical_clustering
k-means_clustering
model-based_clustering

For each of these runs, the following files will be output:

  • *.png|pdf Genes plotted in the new coordinate system defined by the first two principal components. Colours indicating to which cluster the genes are assigned.
  • *.taXaminer.csv A table holding, for each gene, the raw_gene_table variables, together with additional columns, providing its new coordinates and another column, holding its cluster assignment. This is the taXaminer report that should be provided with each annotated assembly.

The directory genes_by_cluster will, for each run, hold as many files as there are clusters. Each file lists the names of all genes that have been assigned to this cluster. There is an option to incorporate .taXaminer.csv into the original GFF file to create annotations.with_taXaminer.<my_clustering>.gff which holds both annotation data and the taXaminer report (see sections Additional Scripts).