You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/RawDataConversion.Rmd
+39-25
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
---
2
-
title: "Building a binary classifier from assay data using pathway level features"
2
+
title: "Converting raw assay data/tables into format compatible with netDx algorithm"
3
3
author: "Shraddha Pai & Indy Ng"
4
4
package: netDx
5
5
date: "`r Sys.Date()`"
6
6
output:
7
7
BiocStyle::html_document:
8
8
toc_float: true
9
9
vignette: >
10
-
%\VignetteIndexEntry{01. Build binary predictor and view performance, top features and integrated Patient Similarity Network}.
10
+
%\VignetteIndexEntry{02. Running netDx with data in table format}.
11
11
%\VignetteEngine{knitr::rmarkdown}
12
12
%\VignetteEncoding{UTF-8}
13
13
---
@@ -57,6 +57,7 @@ The fetch command automatically brings in a `MultiAssayExperiment` object.
57
57
```{r, eval = TRUE}
58
58
summary(brca)
59
59
```
60
+
## Prepare Data
60
61
61
62
This next code block prepares the TCGA data. In practice you would do this once, and save the data before running netDx, but we run it here to see an end-to-end example.
To build the predictor using the netDx algorithm, we call the `buildPredictor()` function which takes patient data and variable groupings, and returns a set of patient similarity networks (PSN) as an output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype.
78
+
To build the predictor using the netDx algorithm, we call the `buildPredictor()` function which takes patient data and variable groupings, and returns a set of patient similarity networks (PSN) as an output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype. This is done specifically by telling the model how to:
78
79
79
-
## groupList object
80
+
***group** different types of data and
81
+
***define similarity** for each of these (e.g. Pearson correlation, normalized difference, etc.,).
82
+
83
+
The relevant input parameters are:
84
+
85
+
*`groupList`: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways)
86
+
*`sims`: a list specifying similarity metrics for each data layer
87
+
88
+
## `groupList`: Grouping variables to define features
80
89
81
90
The `groupList` object tells the predictor how to group units when constructing a network. For examples, genes may be grouped into a network representing a pathway. This object is a list; the names match those of `dataList` while each value is itself a list and reflects a potential network.
82
91
@@ -97,17 +106,20 @@ for (k in 1:length(expr)) { # loop over all layers
97
106
}
98
107
```
99
108
100
-
## Define patient similarity for each network
109
+
## `sims`: Define patient similarity for each network
110
+
111
+
**What is this:**`sims` is used to define similarity metrics for each layer.
112
+
This is done by providing a single list - here, `sims` - that specifies the choice of similarity metric to use for each data layer. The `names()` for this list must match those in `groupList`. The corresponding value can either be a character if specifying a built-in similarity function, or a function. The latter is used if the user wishes to specify a custom similarity function.
101
113
102
-
`sims` is a list that specifies the choice of similarity metric to use for each grouping we're passing to the netDx algorithm. You can choose between several built-in similarity functions provided in the `netDx` package:
114
+
The current available options for built-in similarity measures are:
103
115
104
-
*`normDiff` (normalized difference)
105
-
*`avgNormDiff` (average normalized difference)
106
-
*`sim.pearscale` (Pearson correlation followed by exponential scaling)
107
-
*`sim.eucscale` (Euclidean distance followed by exponential scaling) or
108
-
*`pearsonCorr` (Pearson correlation)
116
+
*`pearsonCorr`: Pearson correlation (n>5 measures in set)
117
+
*`normDiff`: normalized difference (single measure such as age)
118
+
*`avgNormDiff`: average normalized difference (small number of measures)
119
+
*`sim.pearscale`: Pearson correlation followed by exponential scaling
120
+
*`sim.eucscale`: Euclidean distance followed by exponential scaling
109
121
110
-
You may also define custom similarity functions in this block of code and pass those to `makePSN_NamedMatrix()`, using the `customFunc` parameter.
122
+
In this example, we choose Pearson correlation similarity for all data layers.
111
123
112
124
```{r,eval=TRUE}
113
125
sims <- list(a="pearsonCorr", b="pearsonCorr")
@@ -144,6 +156,17 @@ We can then proceed with the rest of the netDx workflow.
144
156
145
157
# Build predictor
146
158
159
+
Now we're ready to train our model. netDx uses parallel processing to speed up compute time. Let's use 75% available cores on the machine for this example. netDx also throws an error if provided an output directory that already has content, so let's clean that up as well.
160
+
161
+
```{r,eval=TRUE}
162
+
nco <- round(parallel::detectCores()*0.75) # use 75% available cores
163
+
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))
164
+
165
+
outDir <- paste(tempdir(),"pred_output",sep=getFileSep()) # use absolute path
166
+
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
167
+
numSplits <- 2L
168
+
```
169
+
147
170
Finally we call the function that runs the netDx predictor. We provide:
148
171
149
172
* patient data (`dataList`)
@@ -154,26 +177,17 @@ Finally we call the function that runs the netDx predictor. We provide:
154
177
* threshold to call feature-selected networks for each train/test split (`featSelCutoff`); only features scoring this value or higher will be used to classify test patients,
155
178
* number of cores to use for parallel processing (`numCores`).
156
179
157
-
The call below runs 10 train/test splits.
158
-
Within each split, it:
180
+
The call below runs two train/test splits. Within each split, it:
159
181
160
182
* splits data into train/test using the default split of 80:20 (`trainProp=0.8`)
161
-
* score networks between 0 to 10 (i.e. `featScoreMax=10L`)
162
-
* uses networks that score >=9 out of 10 (`featSelCutoff=9L`) to classify test samples for that split.
183
+
* score networks between 0 to 2 (i.e. `featScoreMax=2L`)
184
+
* uses networks that score >=9 out of 10 (`featSelCutoff=1L`) to classify test samples for that split.
163
185
164
186
In practice a good starting point is `featScoreMax=10`, `featSelCutoff=9` and `numSplits=10L`, but these parameters depend on the sample sizes in the dataset and heterogeneity of the samples.
165
187
166
-
This step can take a few hours based on the current parameters, so we comment this out for the tutorial and will simply load the results.
167
-
168
-
```{r lab1-buildpredictor ,eval=TRUE}
169
-
nco <- round(parallel::detectCores()*0.75) # use 75% available cores
170
-
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))
171
-
188
+
```{r,eval=TRUE}
172
189
t0 <- Sys.time()
173
190
set.seed(42) # make results reproducible
174
-
outDir <- paste(tempdir(),randAlphanumString(),
175
-
"pred_output",sep=getFileSep())
176
-
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
0 commit comments