forked from kielejocain/how-to-caret
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_modeling_with_caret.Rmd
306 lines (221 loc) · 12.6 KB
/
data_modeling_with_caret.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
title: "Data Modeling with Caret"
author: "Kyle Joecken"
date: "November 14, 2015"
output: pdf_document
---
# Part 1: Machine Learning in R Without `caret`
## A simple machine learning example that does not take advantage of `caret`
First, we load the required libraries and read our data into data frames. The following data (and most of the code) come from kaggle's introductory "Digit Recognizer" optical character recognition problem.
```{r, message=FALSE, warning=FALSE}
library(randomForest)
trainData <- read.csv("data/digits/train.csv")
trainData$label <- as.factor(trainData$label)
testData <- read.csv("data/digits/test.csv")
```
Let's verify that the data look properly loaded, and inspect the format.
```{r}
dim(trainData); dim(testData); str(trainData[, 1:6])
```
We see that the first column (`label`) is the actual digit represented by an image, and the remaining 784 columns are greyscale values for each pixel in the 28x28 images.
After setting the seed to keep our results repeatable, we select a random subsample of our data and pull the predictors from the outcome.
```{r}
set.seed(0)
numTrain <- 10000
rows <- sample(1:nrow(trainData), numTrain)
labels <- as.factor(trainData[rows,1])
subtrain <- trainData[rows,-1]
```
Finally, we build a random forest model on the subset of the `train` data set, computing the predicted outcomes of the test set along the way. The actual [`randomForest`](http://www.inside-r.org/packages/cran/randomforest/docs/randomforest) function is called with four arguments, though only the first is necessary.
`randomForest(x, y=NULL, xtest=NULL, ntree=500)`
There are over a dozen additional potential parameters to pass, including `mtry`, the number of predictors to randomly sample at each break point.
```{r}
numTrees <- 25
rf <- randomForest(subtrain, labels, xtest=testData, ntree=numTrees)
predictions <- data.frame(
ImageId=1:nrow(testData),
Label=levels(labels)[rf$test$predicted]
)
head(predictions)
```
This went rather smoothly. But:
- What if I want to reserve my own data set for validation before predicting on the test set?
- What if I want further details on factor selection done by the model?
- What if I simply want to try a different model?
`caret` helps will all of these things and more.
# Part 2: Data and Model Exploration in `caret`
## Variable importance and parameter tuning
Let's improve upon kaggle's example model by applying some of `caret`'s functionality. We begin by loading the `caret` package. We will simultaneously load a parallel processing package `doMC` and tell it how many cores we're rocking (the Mac on which I wrote this has four cores with two threads each). For those packages that implement some form of parallelization, `caret` does not interfere. `randomForest` is definitely one of those packages.
See the [`caret` documentation](http://topepo.github.io/caret/parallel.html) for additional information.
```{r, message=FALSE, warning=FALSE}
library(caret)
library(doMC)
registerDoMC(8)
```
## `createDataPartition`
The first function we will want to learn is `caret`'s data partitioning function. Here is the function call from the [documentation](http://www.inside-r.org/node/87010):
createDataPartition(
y,
times = 1,
p = 0.5,
list = TRUE,
groups = min(5, length(y))
)
This function takes `times` samples from your data vector `y` of proportion `p`. If your data are discrete, `createDataPartition` will automatically take a representative sample from each level as best as it can; otherwise, you can use `groups` to help `caret` partition a continuous variable.
The values returned are chosen indices from `y`.
##`train`
This function trains your model. Again from the slimmed down [docs](http://www.inside-r.org/packages/cran/caret/docs/train):
train(
x,
y,
method = "rf",
...
)
This call returns a `train` object, which is basically a list. The model contained is built applying the given `method` (in this case `"rf"` means random forest) to the predictors in the data frame `x` and with associated vector of outcomes `y`. As `caret` is really just a wrapper for the underlying packages that deploy varous methods, we can pass additional arguments through the ellipses as needed.
Let's have a look at an example. These lines are nearly identical to those from kaggle's "benchmark" code. A few things are different:
- I want to plot soon, so I reduced from a sample of 10,000 to one of about 1,000
- I asked `randomForest` to keep track of importance variables, which it does not do by default
You can see that we pass `list=FALSE` to `createDataPartition`; as we only have one sample, we'd like to have our row numbers in a vector so that we can easily subset our data with it. We also used the formula implementation of the `train` function rather than slice the data frame via `train(naiveData[, -1], naive$label, ...)`.
```{r}
set.seed(0)
inTrain <- createDataPartition(trainData$label, p=1/40, list=FALSE)
naiveData <- trainData[inTrain, ]
naiveModel <- train(
label ~ .,
data = naiveData,
method="rf",
ntree=25,
importance=TRUE
)
```
## `varImp`
Since we've asked `randomForest` to keep track of importance, let's have a look at it. The `varImp` function computes importance on a scale from 0 to 100 (by default--set `scale=FALSE` to return the raw score used).
```{r}
varImp(naiveModel)
```
## `featurePlot`
A wrapper for various `lattice` plots. Once more, the call string from [documentation](http://www.inside-r.org/packages/cran/caret/docs/featurePlot):
featurePlot(
x,
y,
plot = if(is.factor(y)) "strip" else "scatter",
...
)
As before, `x` holds the predictor data frame and `y` holds the outcome vector. `plot` is a string corresponding to the type of plot you want (e.g., `"pairs"`). `...` implies that you can add additional arguments to be passed down to the `lattice` plot call.
```{r}
featurePlot(
x = naiveData[, c(320, 380, 432, 543, 600, 1)],
y = naiveData$label,
plot = "pairs",
alpha = 1/20,
auto.key = list(columns = 10)
)
```
## train(tuneGrid)
As an optional argument to pass to `train`, `tuneGrid` allows you to pass in various combinations of hyperparameters to your model in an effort to optimize them. The [`caret` documentation](http://topepo.github.io/caret/training.html#grids) has a nice example that demonstrates how you make a simple matrix of hyperparameter combinations, save it as a named matrix, and pass that in as the `tuneGrid` argument.
If you want to know what hyperparameters a particular method takes, simply call the `modelLookup` function (e.g., `modelLookup("rf")`). What returns will be a printout of each hyperparameter by name, description, and some indicators of its intended use. For additional details, you'll need to check the documentation of the underlying package.
**Note:** You must name your tuning grid! `caret` will get angry if you try to pass in a call to `expand.grid`.
For `randomForest` (`method="rf"`), there is only one hyperparameter: `mtry`. This tells `randomForest` how many of the predictors to try and split on at each node. By default, `randomForest` takes a random sample of the square root of the total and tries to split on those. In our case, 28 x 28 = 784 pixels means the default is 28 pixels chosen at each split. But what if that isn't best?
```{r}
set.seed(12345)
inTrain <- createDataPartition(trainData$label, p=0.5, list=FALSE)
fitGrid <- expand.grid(
mtry = (1:8) * 10 - 2
)
rfModel <- train(
label ~ .,
data = trainData[inTrain, ],
method="rf",
tuneGrid=fitGrid,
ntree=25
)
```
If you ask `R` to print the `train` object, it outputs a nice summary that includes (within reason) a list of the parameter combinations and the resulting 'quality' metrics (these can be changed).
```{r}
print(rfModel)
```
Similarly, if you plot a `train` object, you get a graph of your metric against your hyperparameter(s).
```{r}
plot(rfModel)
```
Do you like `ggplot2`? So does `caret`!
```{r}
ggplot(rfModel)
```
# Part 3: Model Validation
## Tuning and performance
In this final section, we'd like to look at a few tools that can help validate and analyze your models. For a classification task, the most obvious such tool is the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). Perhaps unsurprisingly, `caret` has an aptly-named helper function.
## confusionMatrix.train
`caret`'s `confusionMatrix` function has two iterations, and the first applies to a `train` object. Assuming that the outcomes of the method call were explicitly discrete, calling `confusionMatrix(myModel)` will return a simple diagram that shows how frequently each level was guessed correctly or confused for a different level. This is simply a finer level of detail on the accuracy score we've already been shown by printing the `train` object directly.
```{r}
confusionMatrix(rfModel)
```
## predict
What if we want to know how the model performs on data it wasn't trained on? We'll need to apply it to other data we've been holding back (the point of `createDataPartition`), and compare that to truth values for those data. Once again to the [docs](http://www.inside-r.org/packages/cran/caret/docs/extractPrediction):
predict(
object,
newdata = NULL,
...
)
Here, `object` is the `train` object we're using to predict, and `newdata` is a data frame containing the withheld data. As with other `caret` functions, this is essentially a wrapper for the prediction functions of the various packages, so additional arguments are ocassionally necessary and can be passed through the ellipsis.
```{r}
rfValidData <- predict(rfModel, trainData[-inTrain, ])
```
## confusionMatrix
Now that we have used our model to predict the outcomes for new data, we'll want to compare that to the known truth values. This is the other (perhaps more useful) version of `confuseMatrix`. As usual, the [docs](http://www.inside-r.org/node/86995):
confusionMatrix(
data,
reference,
...
)
Here, `data` is a vector of newly predicted data and `reference` are the truth values.
```{r}
confusionMatrix(rfValidData, trainData[-inTrain, "label"])
```
## trainControl
This is going rather well, but we have not yet considered how the model is validating itself as it trains. By default, `caret` uses boostrap resampling; this can be changed, however. Much like `tuneGrid`, there is a `trControl` argument to the `train` function that takes the output of the `trainControl` function (as documented [here](http://www.inside-r.org/packages/cran/caret/docs/trainControl)):
trainControl(
method = "boot",
number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25),
repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number,
...
)
You can set `method` to be a string like `"repeatedcv"` to change the resampling method, and pass additional parameters that suit your method. I've mentioned the ones that have to do with repeated cross-validation, but there are many others in the docs if you are interested.
```{r}
set.seed(2967)
inTrain <- createDataPartition(trainData$label, p = 0.5, list = FALSE)
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3
)
# hyperparameters must be passed through the tuneGrid argument, even if constant
fitGrid <- expand.grid(mtry = 58)
finalModel <- train(
label ~ .,
data = trainData[inTrain, ],
method = "rf",
trControl = fitControl,
tuneGrid = fitGrid
)
```
How did we do this time?
```{r}
print(finalModel)
```
```{r}
confusionMatrix(finalModel)
```
Is that just overfit? How about on the other, reserve half of the data?
```{r}
validData <- predict(finalModel, trainData[-inTrain, ])
confusionMatrix(validData, trainData[-inTrain, "label"])
```
# Part 4: Other Features
## Additional Models
- Is a random forest not appropriate for your modeling task? There are over 200 other [models `caret` can handle](http://topepo.github.io/caret/modelList.html).
- Don't see what you want? Well, you'll get no help from me, but `caret` is capable of handling [custom models](http://topepo.github.io/caret/custom_models.html).
## Additional additions
- Instead of `createDataPartition` using the outcome, you can [split on the predictors](http://topepo.github.io/caret/splitting.html#predictors) using (for example) maximum dissimilarity.
- You can also affect class subsampling by having `caret` [up- or down-sample](http://topepo.github.io/caret/sampling.html) so that underrepresented classes carry more weight in model training.
- `caret` can help [pre-process your data](http://topepo.github.io/caret/preprocess.html), often from right inside the `train` function.