KINC generates several custom "data objects" stored as binary files. These files represent the data types that KINC consumes and produces. Additionally, KINC can convert these objects to and from more conventional plain-text formats. Each data object and plain-text format is described here.
The GEM is the starting point for all network construction. It is a tab-delimited text file containing an n x m matrix of gene expression values where rows (or lines in the file) are genes, columns are samples and elements are the expression values for a gene in a given sample.
The GEM may have a header line. But the header line must only contain the sample names. Each subsequent line will always start with the gene name, followed by the expression values. Therefore, the header line, if present, should always have one less element then every other row.
Note
The header line, if present, should always have one less element then every other row. Missing (NAN) values are often represented by a special token such as "NA" or "0.00".
A very simple example is shown below:
Sample1 Sample2 Sample3 Sample4 Gene1 0.523 0.991 0.421 NA Gene2 NA 7.673 3.333 9.103 Gene3 4.444 5.551 NA 0.013
Note
The GEM is imported into KINC and stored in binary format as an Expression Matrix file (described below). KINC can also export the expression matrix back into its text-based GEM format.
The annotation matrix is a tab-delimited or comma-separated file that contains metadata about the RNA-seq samples used to construct the network. The first column should list the sample names (the same as in the GEM) and each additional column should contain information about the sample such as experimental condtions (e.g. Treatment, Tissue, Developmental Stage, Sampling Time, Genotype, etc.) which are usually categorical data. The file can also contain phenotype information that may have been collected from the individuals from which the samples were taken.
Note
The annotation matrix is only needed if condition-specific subgraphs are wanted.
The Expression Matrix (EMX) file is a data matrix, with genes as rows and samples as columns. It is constructed from the tab-delimited plain-text input Gene Expression Matrix (GEM) file. It contains the matrix data in binary format as well as the gene names and sample names. EMX files have the .emx
extension.
The Correlation Matrix (CMX) file is created during the similarity
step and is used to represent a similarity matrix, which is a pairwise matrix of similarity (e.g. correlation) scores for each gene pair using data from an expression matrix (EMX). For traditional network construction, the similarity matrix will contain a single similarity score. For the GMM approach, the similarity matrix will contain multiple scores, one for each cluster identifed using GMMs. The CMX uses a sparse matrix format--it only stores the gene pairs which have correlation data. Otherwise, the file would grow too large for most file systems. CMX files have the .cmx
extension.
The Cluster Composition Matrix (CCM) file is created during the similarity
step only when using the GMM approach to network construction. This maintains a matrix identical to the CMX file but instead of similarity scores for each clsuter, it stores a "sample composition string". This string is a series of numbers, strung together indicating which samples belong to the cluster. If a 1 appears in the first position, it indicates the the first sample in the EMX file belongs to the cluster. If a 0 appears it indicates the sample does not. Several other numbers are present to indicate missing or outlier samples. The same is true for the 2nd, 3rd and up to the nth samples. The following indicates the meaning of each number:
0
: The sample is not present in the cluster.1
: The sample is present in the cluster.6
: The sample was removed because the expression level in one gene was below the allowed expression threshold.7
: The sample was removed as an outlier prior to GMMs.8
: The sample was removed as an outlier after the GMM cluster was identified.9
: The sample was ignored because one of the genes was missing expression in the sample.
The CCM also uses a sparse matrix format and has a .ccm
extension.
The Condition Specific Matrix (CSM) file is created when the cond-test
thresholding function is used. This thresholding approach can only be used if the GMM approach was used. It requires an “Annotation Matrix” as input, which is a tab-delimited file where the rows are samples and each column contains some feature about the sample, such as experimental conditions or phenotypic values. The file will contain a matrix identical to the CMX matrix but instead of similarity scores, it stores p-values for the association of selected feature from the annotation matrix. The CSM also uses a sparse matrix format and has a .csm
extension.
There are two types of plain-text output files. First is the full format: a tab-delimited file where each line contains information about a single edge in the network. The file contains the following columns:
Source
: The first gene in the edgeTarget
: Thes second gene in the edgeSimilarity_Score
: The similarity score value.Interaction
: The interaction type. This always isco
for co-expression.Cluster_Index
: The unique cluster index (starting from zero)Cluster_Size
: The total number of samples in the cluster.Samples
: The sample composition string.
The following is a sample line from a network file:
Source Target Similarity_Score Interaction Cluster_Index Cluster_Size Samples
Gene1 Gene2 0.979 co 0 30 1199991911111161111111611161111111111770080000000
Additionally, if the cond-test
function was performed, a series of additional columns will be present containing the p-values for each test performed. A `Tidy<https://en.wikipedia.org/wiki/Tidy_data>`_ format for the cond-test
results can also be exported. The Tidy format is the recommended format.
The second format is the minimal format, which does not contain the sample string or summary statistics. This format is useful for inspecting large networks quickly. The following is a sample line of a minimal network file:
Source Target sc Cluster Num_Clusters
Gene1 Gene2 0.979 0 1
The second major network file format is the GraphML format. This is a common XML format used for representing networks. The following is an example snippet of a GraphML file generated by KINC:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="undirected">
<node id="Gene1"/>
<node id="Gene2"/>
<edge source="Gene1" target="Gene2" samples="1199991911111161111111611161111111111770080000000"/>
</graph>
</graphml>
A plain-text correlation matrix is a representation of a sparse matrix where each line is a correlation. It includes the pairwise index, correlation value, sample composition string, and several other summary statistics. The following is a sample line from the correlation matrix file:
0 1 0 1 30 5 2 1 3 0.979 1199991911111161111111611161111111111770080000000