edits

deangladish · May 22, 2018 · 2e580a0 · 2e580a0
1 parent 96f19c8
commit 2e580a0
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 26 deletions.
diff --git a/Case-Study-3-CodeSup.rmd b/Case-Study-3-CodeSup.rmd
@@ -28,16 +28,17 @@ library(GGally)
 library(effects)
 ```
 
-After loading the necessary libraries, we also need to load the dataset of interest:  
+After loading the necessary libraries, we first need to take a look at the data: 
 
 ```{r, message = F, warning = F}
 nes <- read.csv("http://aloy.rbind.io/data/NES.csv")
 head(nes)
 summary(nes)
+sum(is.na(nes))
 
 ```
 
-The following explains our derivation of a model:  
+After inspecting the data and getting a bit of an idea about what our data set looked like (which features have numeric values, which features are binary, ect) we began to select our model. The following explains our derivation of a model:  
 
 ```{r, message = F, warning = F}
 # The following code generates our baseline model for dem.  
@@ -47,36 +48,40 @@ summary(glm.base)
 # Regressing on a constant allows us to hold everything except for dem (party 
 # identification) constant.  
 glm.basic <- glm(dem ~ 1, data = nes, family = binomial)
+```
+
+After fitting a couple of initial models, used BIC to proform stepwise selection of a final model. We started off trying to find the best model using only the untransformed variables, but we also checked if adding square or interaction terms would produce a better model, but none of these more complex terms ended up in the final model. Interestingly, region was not included in our final model.
 
+```{r, message = F, warning = F}
+stpFwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both", k = log(nrow(nes)))
 
-stpFwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both")
 summary(stpFwd)
-stpBk <- stepAIC(glm.base, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both")
+
+stpBk <- stepAIC(glm.base, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both", k = log(nrow(nes)))
+
 summary(stpBk)
 
 glm.square <- glm(dem ~ year + region + union + income + educ + gender + race + age + 
                     I(year)^2 + I(region)^2 + I(union)^2 + I(income)^2 + I(educ)^2 + I(gender)^2 + I(race)^2 + I(age)^2, 
-                  data = nes, family = binomial)
+                            data = nes, family = binomial, k = log(nrow(nes)))
 
 glm.inter <- glm(dem ~ year + region + union + income + educ + gender + race + age + 
-                   age * year + age * region + age * union + age * income + age * educ + age * gender + age * race, data = nes, family = binomial)
+                   age * year + age * region + age * union + age * income + age * educ + age * gender + age * race, data = nes,                               family = binomial, k = log(nrow(nes)))
 summary(glm.inter)
+
 stp.inter <- stepAIC(glm.inter, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + 
-                                               age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
-                     direction = "both")
+                                               age * year + age * region + age * union + age * income + age * educ + age * gender +                                                   age * race), direction = "both", k = log(nrow(nes)))
 
 glm.inter <- glm(dem ~ year + region + union + income + educ + gender + race + age + 
-                  + age * union + age * income + age * educ + age * race, data = nes, family = binomial)
+                  + age * union + age * income + age * educ + age * race, data = nes, family = binomial, k = log(nrow(nes)))
 
 stp.inter <- stepAIC(glm.inter, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + 
-                                               age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
-                     direction = "both", k = log(nrow(nes)))
+                                               age * year + age * region + age * union + age * income + age * educ + age * gender +                                                    age * race), direction = "both", k = log(nrow(nes)))
 
 summary(stp.inter)
 
-stp.inter.fwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + 
-                                                   age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
-                         direction = "both", k = log(nrow(nes)))
+stp.inter.fwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age +                                    age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
+                                         direction = "both", k = log(nrow(nes)))
 
 summary(stp.inter.fwd)
 
@@ -97,21 +102,22 @@ plot(allEffects(stp.inter.fwd), rows = 2, cols = 3, type = "link",
 
 ```
 
-As we can see from the plots, males have lower odds of supporting the Democratic party.  
-
-Although people in North Carolina and the Southern region have lower odds of supporting the Democratic party, our model has been simplified such that these do not matter.  
+We can see from the plots that males, white people, older people, and the wealthy have lower odds of supporting the Democratic party.  
 
-Those who are not in unions also have lower odds of supporting the Democratic party.  
+Union members have higher odds of supporting the Democratic party.  
 
 For further analysis of the probability that any given individual supports the Democrats, we can use the following code:  
 
 ```{r, message = F, warning = F}
 plot(Effect(c("gender", "union"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
+plot(Effect(c("income", "gender"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
+plot(Effect(c("race", "gender"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
+plot(Effect(c("race", "age"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
 ```
 
-This code allows us to more clearly see that Support of the Democratic Party tends to come from people who are in unions and who are female.  
+This graphs clarify the connection between gender, union membership, income, age, race, and support of the Democratic Party.
 
-More specifically, Unionization seems to have the largest effect on support, followed by Gender and then Region.  
+They show us that baed on our model having a higher income, being white, being male, being of older age, and not being a member of a union increase the chance that a person is not a democrat, and if a person is the oppisite of these things then they are likely a democrat.
 
 NOW, we need to assess the significance of these effects regardless of time.  
 
@@ -137,7 +143,7 @@ anova(gender_only, glm.base, test = "Chisq")
 shows that we can reject the notion that the other coefficients are not necessary.  
 
 ```{r, message = F, warning = F}
-plot(glm.base$residuals)
+plot(stp.inter.fwd$residuals)
 ```
 
 The residuals plot shows that our model generally fits the data.  
@@ -148,9 +154,9 @@ anova(stp.inter.fwd, test = "Chisq")
 qqnorm(stp.inter.fwd$residuals)
 ```
 
-Our p-values indicate that none of our coefficients are non-significant at the 0.05 level.  
+Our p-values indicate that all of our variables are significant.
 
-Despite the qqplot indicating a potential lack of normality, this is typical for multiple regression and our model does fit the data as indicated by the Loess smoother on our Deviance residuals plots.  
+Despite the qqplot indicating a potential lack of normality, this is typical for multiple regression and our model does fit the data as indicated by the Loess smoother on our Deviance residuals plots below:
 
 ```{r, message = F, warning = F}
 residualPlots(stp.inter.fwd, tests = false, type = "response")

diff --git a/Case-Study-3-WriteUp.rmd b/Case-Study-3-WriteUp.rmd
@@ -11,11 +11,11 @@ knitr::opts_chunk$set(echo = TRUE)
 
 ## Introduction  
 
-\ \ \ \ \ \ \ In this modern world, party preference is typically thought to be associated with the demographics and geography of a populace.  It is of interest to politicians and the media alike to determine the extent of such correlation in order to understand which groups are most likely to vote for the party.  Our case study, which uses data collected from U.S. adults from the 1980 and 2000 elections respectively as part of the National Election Studies project, is an investigation into the matter that allows us to model party preference using the logistic regression model.  Specifically, we aim to address whether gender, regional, and union differences play a part in party preference over time.  
+\ \ \ \ \ \ \ Political party preference is typically thought to be associated with the demographics and geography of a populace.  It is of interest to politicians, political scientists and the media alike to determine the extent of such correlation in order to understand which groups are most likely to vote for the party.  Our case study, which uses data collected from U.S. adults from the 1980 and 2000 elections respectively as part of the National Election Studies project, is an investigation into the matter that allows us to model party preference using the logistic regression model.  Specifically, we aim to address whether gender, regional, and union differences play a part in party preference over time.  
 
 ## Data  
 
-\ \ \ \ \ \ \ The dataset that we analyzed consists of a binary indicator variable indicating Democratic Party preference as well as numerous other categorical variables corresponding to factors such as year, age, gender, race, region, income, unionized and educational status.  The three explanatory variables that we focus on are *gender*, *region*, and *union*.  
+\ \ \ \ \ \ \ The dataset that we analyzed consists of a binary indicator variable indicating Democratic Party preference as well as numerous other categorical variables corresponding to factors such as year, age, gender, race, region, income, unionized and educational status.  The explanatory variables that we focus on are *gender*, *race*, *income*, *age* and *union*.  
 
 
 The following table gives our estimates of the important aspects (coefficients, etc.) of our model:  
@@ -66,7 +66,7 @@ p1 <- crPlot(stp.inter.fwd, variable = "gender")
 
 ## Results:  
 
-\ \ \ \ \ \ \ Using the AIC criterion we obtained the following model:  
+\ \ \ \ \ \ \ Using the BIC criterion we obtained the following model:  
 
 \begin{center} $\widehat Y_i \{dem\} = \beta_0 + \beta_1 raceother + \beta_2 racewhite + \beta_3 unionyes + \beta_4 incomemiddle \ 1/3 + \beta_5 incomeupper \ 1/3 + \beta_6 age + \beta_7 gendermale$ \end{center}
 
@@ -75,6 +75,7 @@ The model shown above can be interpreted as follows:
 
 ## Discussion:  
 
+\ \ \ \ \ \ \