-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPractical_Machine_Learning_Course_Project.rmd
223 lines (171 loc) · 7.67 KB
/
Practical_Machine_Learning_Course_Project.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
title: "Practical Machine Learning - Course Project"
author: "Dr. Robert Bauer"
date: "April 10, 2020"
output: html_document
---
## Background
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
Each participant was asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
* exactly according to the specification (Class A),
* throwing the elbows to the front (Class B),
* lifting the dumbbell only halfway (Class C),
* lowering the dumbbell only halfway (Class D) and
* throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg).
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz6NA7beuhc
## Objective
The goal of this project is to predict the manner in which they did the exercise, represented by the "classe" variable in the training set. All other variables can be used as prediction variables.
This report describes the set up and cross validation of the modelling, quantifies the expected out of sample error. Finally, the model is being used to predict 20 different test cases.
## Getting started
First we need to load the required packages and get the training and test data.
```{r collapse=TRUE}
rm(list = ls())
library(caret)
# load data
train_file <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training <- read.csv(train_file)
test_file <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing <- read.csv(test_file)
```
## Preparing the data
As a first checkup, we need to verify that both data frames have the same variables.
```{r collapse=TRUE}
dim(training)
dim(testing)
```
It appears that this is quite a long list of column names (variables). Let's check it with some R-code:
```{r collapse=TRUE}
all(names(training) %in% names(testing))
names(training)[which(!(names(training) %in% names(testing)))]
```
classe-variable is the only missing variable in the testing dataframe.
Now let's check on the training data frame:
```{r collapse=TRUE}
all(names(testing) %in% names(training))
names(testing)[which(!(names(testing) %in% names(training)))]
```
problem-id is the only missing variable in the training dataframe
```{r}
testing[['problem_id']]
```
While we obviously need to keep the classe-column in the training dataset as our response-variable, we can delete the problem_id-variable from the testing data set, since it's just returning the row lines.
```{r}
testing[['problem_id']] <- c()
```
randomly split the full training data (training) into a smaller training set (Train_Set) and a cross validation set (Val_Set):
```{r}
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
Train_Set <- training[inTrain, ]
Val_Set <- training[-inTrain, ]
```
drop variables that are mostly NA or have low variance
```{r}
NA_variables <- which(sapply(Train_Set, function(x) length(which(is.na(x)))/length(x)) > 0.95)
near0_variables <- nearZeroVar(Train_Set)
dropvars <- unique(c(NA_variables, near0_variables))
Train_Set <- Train_Set[, -dropvars]
Val_Set <- Val_Set[, -dropvars]
```
check for colinear variables
```{r}
library(data.table)
cor_matrix <- cor(Train_Set[sapply(Train_Set, is.numeric)])
ctable <- data.table::melt(cor_matrix)
qplot(x=Var1, y=Var2, data=ctable, fill=value, geom="tile") +
scale_fill_gradient2(limits=c(-1, 1)) +
theme(axis.text.x = element_text(angle=-90, vjust=0.5, hjust=0))
```
Search through the correlation matrix and return the columns to remove to reduce pair-wise correlations.
```{r}
colinvars <-findCorrelation(cor_matrix, cutoff = .90)
Train_Set <- Train_Set[, -colinvars]
Val_Set <- Val_Set[, -colinvars]
```
Subsample Training Set to work with (to speed up the modelling)
```{r}
TSet <- Train_Set[sample(1:nrow(Train_Set),size = 1500,replace = F),]
```
This leaves the following models for fitting:
```{r}
names(TSet)[names(TSet) != "classe"]
```
***
## Model building
Set the seed to 2616 and predict "classe" with all the other variables using the following three models:
1. a random forest decision trees ("rf"),
2. decision trees with CART (rpart),
3. gradient boosting trees ("gbm") and
4. linear discriminant analysis ("lda") model.
### random forest descision trees
Let's start training and predicting with the random forest decision trees:
```{r}
set.seed(2616)
trControl <- trainControl(method="cv", number=5) # include cross validation as train control method.
modFit_rf <- train(classe ~ .,method="rf", data=TSet, trControl=trControl,verbose=FALSE) #, ntree=100)
cfm_rf <- confusionMatrix(Val_Set$classe, predict(modFit_rf, newdata = Val_Set))
print(cfm_rf)
```
## decision trees with CART (rpart)
```{r}
set.seed(2616)
trControl <- trainControl(method="cv", number=5) # include cross validation as train control method.
modFit_rpart <- train(classe ~ .,method="rpart", data=TSet, trControl=trControl)
cfm_rpart <- confusionMatrix(Val_Set$classe, predict(modFit_rpart, newdata = Val_Set))
print(cfm_rpart)
```
optional:
```{r}
library(rattle)
fancyRpartPlot(modFit_rpart$finalModel)
```
## gradient boosting trees ("gbm")
```{r}
set.seed(2616)
trControl <- trainControl(method="cv", number=5) # include cross validation as train control method.
modFit_gbm <- train(classe ~ .,method="gbm", data=TSet, trControl=trControl, verbose=FALSE)
cfm_gbm <- confusionMatrix(Val_Set$classe, predict(modFit_gbm, newdata = Val_Set))
print(cfm_gbm)
```
## linear discriminant analysis ("lda")
```{r, message=FALSE}
trControl <- trainControl(method="cv", number=5) # include cross validation as train control method.
modFit_lda <- train(classe~., method = 'sparseLDA',data=TSet, trControl=trControl, verbose=F)
cfm_lda <- confusionMatrix(Val_Set$classe, predict(modFit_lda, newdata = Val_Set))
print(cfm_lda)
```
Get accuracy estimates from the confusion matrices of each model:
```{r}
models <- c("modFit_rf","modFit_rpart", "modFit_gbm","modFit_lda")
n <- length(models)
accuracy <- rep(NA,n)
for(i in 1:n){
cfm <- get(paste0("cfm_",gsub("modFit_","",models[i])))
accuracy[i] <- cfm$overall[1]
}
comp <- data.frame(model=gsub("modFit_","",models),accuracy)
print(comp)
```
Our test reveals that the random forest and gbm models are clearly performing best, with only slight differences between them.
## Variable importance
Let's check for the five most important variables in the rf and gbm models and their relative importance values:
```{r}
## random forest
vi <- varImp(modFit_rf)$importance
vi[head(order(unlist(vi), decreasing = TRUE), 5L), , drop = FALSE]
## gbm
library(gbm)
vi <- varImp(modFit_gbm)$importance
vi[head(order(unlist(vi), decreasing = TRUE), 5L), , drop = FALSE]
```
Both models apparently rely mainly on the X-variable.
# Predicting testing data
```{r collapse=TRUE}
predicted_classe_rf <- predict(modFit_rf, newdata = testing)
cat("model preidction by the random forest model\n")
print(predicted_classe_rf)
predicted_classe_gbm <- predict(modFit_gbm, newdata = testing)
cat("model preidction by the gbm model\n")
print(predicted_classe_gbm)
```
Both models also predict the same classe-values on the testing variable.