-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathknn_PredictingBugCovering_fromMajorityVote.Rmd
104 lines (79 loc) · 3.7 KB
/
knn_PredictingBugCovering_fromMajorityVote.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
title: "K-Nearest Neighbor - Predicting Bug Covering by Majority Voting"
author: "Christian Medeiros Adriano"
date: "August 7, 2017"
output: html_document
---
```{r setup, include=FALSE}
source("C://Users//chris//OneDrive//Documentos//GitHub//ML_VotingAggregation//aggregateAnswerOptionsPerQuestion.R");
summaryTable <- runMain();
library(class);
library(gmodels);
library(ggplot2);
```
##The goal of the study
<p>
Evaluate the prediction based on the difference between number of YES votes and NO votes. For more detailed explanation, please see the
<a href="http://rpubs.com/christian_adriano/knn_cv_ranking_buggy_codefragments">previous analysis</a> </p>
<du>Whe study has two goals:
<li>
Train a machine learning algorithm that predicts whether a code fragment is related to a failure or not. For that, I originally devised different metrics. The metric that will explore in the following study consists of Majority vote between YES and NO answers.
</li>
</du>
```{r dataprep, include=FALSE}
##Data preparation
#I need to guarantee that some examples (i.e., failing methods)
#do not dominate the training or testing sets. To do that, I need to get a
#close to equal proportion of examples in both sets. I do that by
#scrambling the data.
set.seed(9850);
g<- runif((nrow(summaryTable))); #generates a random distribution
summaryTable <- summaryTable[order(g),];#reorder the rows based on a random index
#convert columns to numeric
summaryTable[,"majorityVote"] <- as.numeric(unlist(summaryTable[,"majorityVote"]));
#Select only the majorityVote as a feature to predict bugCovering
trainingData <- summaryTable[,c("bugCovering","majorityVote")];
```
## Building the model
<p> I chose knn.cv (cross validation) so I can minimize the risk of lucky selection of training and testing set.
<p> Cross validations was performed by leaving one out
```{r knn.cv.class}
#build model
fitModel.cv <- knn.cv(trainingData, trainingData$bugCovering, k=3, l=0, prob = FALSE, use.all=TRUE);
```
<p> I have also run with differnt levels of k=3,5,7,9, which produced similar results.
##Testing the model
```{r test.knn.cv.class, echo=FALSE}
fitModel.cv.df<-data.frame(fitModel.cv);
CrossTable(x = trainingData$bugCovering, y=fitModel.cv.df[,1], prop.chisq = FALSE);
```
##Estimating the metric
<p>Discover the minimal majority vote value that would have predicted the same bug Covering questions</p>
```{r knn.cv.metric, echo=FALSE}
trainingData$bugCovering <- as.factor(trainingData$bugCovering);
predictedBugCoveringList<-trainingData[fitModel.cv.df[,1]==TRUE,];
predictedList <- as.numeric(unlist(predictedBugCoveringList[,2]));
```
<p>Mean majority vote of the questions categorized as bug covering:</p>
```{r knn.cv.mean, echo=FALSE}
mean(predictedList) #mean vote
```
<p>Minimal majority vote of the questions categorized as bug covering:</p>
```{r knn.cv.minimal, echo=FALSE}
min(predictedList) #minimal vote
```
##Plot metric distribution
```{r plot, echo=FALSE}
predictedList.df <- data.frame(predictedList);
colnames(predictedList.df)<- c("votes");
ggplot(data=predictedList.df, aes(x=predictedList.df$votes)) +
geom_histogram(binwidth = 1,alpha=.5, position="identity")+
geom_vline(aes(xintercept=mean(predictedList.df$votes, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=1) +
ggtitle("Distribution of votes for the questions categorized as bug covering.")+
labs(x="Majority vote values of questions categorized as bug-covering. Mininal vote=-2, mean=3.36",
y="Frequency");
```
<p></p>
<p>By the distribution of majority vote outcomes values, we can note that the metric value for Majority vote has to be larger or equal to -2 (minus two) in order predict bug-covering questions.</p>
<br><br>