Skip to content

Commit e546a90

Browse files
authored
Merge pull request RealPaiLab#16 from RealPaiLab/vignetteFix
ThreeWayClassifier fix
2 parents 0c8a3e1 + 4b3afa6 commit e546a90

File tree

3 files changed

+50
-28
lines changed

3 files changed

+50
-28
lines changed

R/helper.R

+6-2
Original file line numberDiff line numberDiff line change
@@ -267,19 +267,23 @@ topPath <- gsub("_cont.txt","",topPath)
267267

268268
## create groupList limited to top features
269269
g2 <- list();
270+
s2 <- list();
270271
for (nm in names(groupList)) {
271272
cur <- groupList[[nm]]
272273
idx <- which(names(cur) %in% topPath)
273274
message(sprintf("%s: %i features", nm, length(idx)))
274-
if (length(idx)>0) g2[[nm]] <- cur[idx]
275+
if (length(idx)>0) {
276+
g2[[nm]] <- cur[idx]
277+
s2[[nm]] <- sims[[nm]]
278+
}
275279
}
276280

277281
message("* Making integrated PSN")
278282
psn <-
279283
plotIntegratedPatientNetwork(
280284
dataList=dat,
281285
groupList=g2, makeNetFunc=makeNetFunc,
282-
sims=sims,
286+
sims=s2,
283287
aggFun=aggFun,
284288
prune_pctX=prune_pctX,
285289
prune_useTop=prune_useTop,

vignettes/RawDataConversion.Rmd

+39-25
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: "Building a binary classifier from assay data using pathway level features"
2+
title: "Converting raw assay data/tables into format compatible with netDx algorithm"
33
author: "Shraddha Pai & Indy Ng"
44
package: netDx
55
date: "`r Sys.Date()`"
66
output:
77
BiocStyle::html_document:
88
toc_float: true
99
vignette: >
10-
%\VignetteIndexEntry{01. Build binary predictor and view performance, top features and integrated Patient Similarity Network}.
10+
%\VignetteIndexEntry{02. Running netDx with data in table format}.
1111
%\VignetteEngine{knitr::rmarkdown}
1212
%\VignetteEncoding{UTF-8}
1313
---
@@ -57,6 +57,7 @@ The fetch command automatically brings in a `MultiAssayExperiment` object.
5757
```{r, eval = TRUE}
5858
summary(brca)
5959
```
60+
## Prepare Data
6061

6162
This next code block prepares the TCGA data. In practice you would do this once, and save the data before running netDx, but we run it here to see an end-to-end example.
6263

@@ -74,9 +75,17 @@ colData(brca)$ID <- pID
7475

7576
# Create feature design rules (patient similarity networks)
7677

77-
To build the predictor using the netDx algorithm, we call the `buildPredictor()` function which takes patient data and variable groupings, and returns a set of patient similarity networks (PSN) as an output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype.
78+
To build the predictor using the netDx algorithm, we call the `buildPredictor()` function which takes patient data and variable groupings, and returns a set of patient similarity networks (PSN) as an output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype. This is done specifically by telling the model how to:
7879

79-
## groupList object
80+
* **group** different types of data and
81+
* **define similarity** for each of these (e.g. Pearson correlation, normalized difference, etc.,).
82+
83+
The relevant input parameters are:
84+
85+
* `groupList`: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways)
86+
* `sims`: a list specifying similarity metrics for each data layer
87+
88+
## `groupList`: Grouping variables to define features
8089

8190
The `groupList` object tells the predictor how to group units when constructing a network. For examples, genes may be grouped into a network representing a pathway. This object is a list; the names match those of `dataList` while each value is itself a list and reflects a potential network.
8291

@@ -97,17 +106,20 @@ for (k in 1:length(expr)) { # loop over all layers
97106
}
98107
```
99108

100-
## Define patient similarity for each network
109+
## `sims`: Define patient similarity for each network
110+
111+
**What is this:** `sims` is used to define similarity metrics for each layer.
112+
This is done by providing a single list - here, `sims` - that specifies the choice of similarity metric to use for each data layer. The `names()` for this list must match those in `groupList`. The corresponding value can either be a character if specifying a built-in similarity function, or a function. The latter is used if the user wishes to specify a custom similarity function.
101113

102-
`sims` is a list that specifies the choice of similarity metric to use for each grouping we're passing to the netDx algorithm. You can choose between several built-in similarity functions provided in the `netDx` package:
114+
The current available options for built-in similarity measures are:
103115

104-
* `normDiff` (normalized difference)
105-
* `avgNormDiff` (average normalized difference)
106-
* `sim.pearscale` (Pearson correlation followed by exponential scaling)
107-
* `sim.eucscale` (Euclidean distance followed by exponential scaling) or
108-
* `pearsonCorr` (Pearson correlation)
116+
* `pearsonCorr`: Pearson correlation (n>5 measures in set)
117+
* `normDiff`: normalized difference (single measure such as age)
118+
* `avgNormDiff`: average normalized difference (small number of measures)
119+
* `sim.pearscale`: Pearson correlation followed by exponential scaling
120+
* `sim.eucscale`: Euclidean distance followed by exponential scaling
109121

110-
You may also define custom similarity functions in this block of code and pass those to `makePSN_NamedMatrix()`, using the `customFunc` parameter.
122+
In this example, we choose Pearson correlation similarity for all data layers.
111123

112124
```{r,eval=TRUE}
113125
sims <- list(a="pearsonCorr", b="pearsonCorr")
@@ -144,6 +156,17 @@ We can then proceed with the rest of the netDx workflow.
144156

145157
# Build predictor
146158

159+
Now we're ready to train our model. netDx uses parallel processing to speed up compute time. Let's use 75% available cores on the machine for this example. netDx also throws an error if provided an output directory that already has content, so let's clean that up as well.
160+
161+
```{r,eval=TRUE}
162+
nco <- round(parallel::detectCores()*0.75) # use 75% available cores
163+
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))
164+
165+
outDir <- paste(tempdir(),"pred_output",sep=getFileSep()) # use absolute path
166+
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
167+
numSplits <- 2L
168+
```
169+
147170
Finally we call the function that runs the netDx predictor. We provide:
148171

149172
* patient data (`dataList`)
@@ -154,26 +177,17 @@ Finally we call the function that runs the netDx predictor. We provide:
154177
* threshold to call feature-selected networks for each train/test split (`featSelCutoff`); only features scoring this value or higher will be used to classify test patients,
155178
* number of cores to use for parallel processing (`numCores`).
156179

157-
The call below runs 10 train/test splits.
158-
Within each split, it:
180+
The call below runs two train/test splits. Within each split, it:
159181

160182
* splits data into train/test using the default split of 80:20 (`trainProp=0.8`)
161-
* score networks between 0 to 10 (i.e. `featScoreMax=10L`)
162-
* uses networks that score >=9 out of 10 (`featSelCutoff=9L`) to classify test samples for that split.
183+
* score networks between 0 to 2 (i.e. `featScoreMax=2L`)
184+
* uses networks that score >=9 out of 10 (`featSelCutoff=1L`) to classify test samples for that split.
163185

164186
In practice a good starting point is `featScoreMax=10`, `featSelCutoff=9` and `numSplits=10L`, but these parameters depend on the sample sizes in the dataset and heterogeneity of the samples.
165187

166-
This step can take a few hours based on the current parameters, so we comment this out for the tutorial and will simply load the results.
167-
168-
```{r lab1-buildpredictor ,eval=TRUE}
169-
nco <- round(parallel::detectCores()*0.75) # use 75% available cores
170-
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))
171-
188+
```{r,eval=TRUE}
172189
t0 <- Sys.time()
173190
set.seed(42) # make results reproducible
174-
outDir <- paste(tempdir(),randAlphanumString(),
175-
"pred_output",sep=getFileSep())
176-
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
177191
model <- suppressMessages(
178192
buildPredictor(
179193
dataList=brca, ## your data

vignettes/ThreeWayClassifier.Rmd

+5-1
Original file line numberDiff line numberDiff line change
@@ -224,9 +224,13 @@ groupList[["clinical"]] <- list(
224224
)
225225
```
226226

227-
For methylation and proteomic data we create one feature each, where each feature contains all measures for that data type.
227+
For miRNA sequencing, methylation, and proteomic data we create one feature each, where each feature contains all measures for that data type.
228228

229229
```{r,eval=TRUE}
230+
tmp <- list(rownames(experiments(brca)[[1]]));
231+
names(tmp) <- names(brca)[1]
232+
groupList[[names(brca)[[1]]]] <- tmp
233+
230234
tmp <- list(rownames(experiments(brca)[[2]]));
231235
names(tmp) <- names(brca)[2]
232236
groupList[[names(brca)[2]]] <- tmp

0 commit comments

Comments
 (0)