chapter3.Rmd

---
title: "Chapter3"
author: "Nate"
date: "12/17/2018"
output: html_document
---

```{r setup, include=TRUE, message=F}
knitr::opts_chunk$set(comment = NA, message=F, warning=F)
library(faraway)
library(tidyverse)
```

#### 1. For the `prostate` data, fit a model with `lpsa` as the response and the other variables as predictors:

a. Compute 90 and 95% CIs for the parameter associated with `age`. Using just these intervals what could we have deduced about the p-value for age in the regression summary?

```{r prostate_a}
data(prostate) # from library(faraway)

m <- lm(lpsa ~ ., data = prostate)
summary(m)

confint(m, c("age"), .95)
confint(m, c("age"), .90)
```
Based on the confidence intervals `age` is on the border for being considered significant. Because the 90% CI doesn't include 0, `age` is signficant at that level, but the upper bound of the more stringent 95% CI stretches just over zero to 0.0025. The regression sumamry confirms this by returning a p-value of 0.08, close but above the common 0.05 threshold for significance.
   
b. Compute and display a 95% joint confidence region for the parameters associated with `age` and `lbph`. Plot the origin on this display. The location of the origin on the display tells us the outcome for a certain hypothesis test. State that test and its outcome.

```{r prostate_b}
library(ellipse)
plot(ellipse(m, c('age', 'lbph')), type = "l")
points(0, 0, pch = 1)
abline(v= confint(m)['age',], lty = 2)
abline(h= confint(m)['lbph',], lty = 2)
```

The joint null hypothesis `age = lbph = 0`, can not be rejected because the origin lies inside of the confidence region ellipse. Similarly the null hypothesis `age = 0` can not be rejected becasue 0 lies with the 95% confidence bounds and the same is true for the null hypothesis `lbph = 0`.
   
c. In the test, we made a permutation test corresponding to the F-test for the significant of all the predictors. Execute the permutation test corresponding to the t-test for age in the model. (Hint: `summary(g)$coef[4,3]` gets your the t-statistic you need if the model is called `g`)

```{r prostate_c}
t_value <- summary(m) %>% coef() %>% .['age', 't value']

# function to permutate n-times
permute_tmod <- function(nsims) {
    map_dbl(1:nsims,
            ~ lm(sample(lpsa) ~ ., data = prostate) %>%
            summary() %>%
            coef() %>%
            .['age', 't value'])
}

mean(abs(permute_tmod(100)) > abs(t_value))
mean(abs(permute_tmod(1000)) > abs(t_value))
mean(abs(permute_tmod(10000)) > abs(t_value))
```

From section a above we not know the p-value for `age` is 0.08229, and we can see the return value from the permutation getting closer to that number as the number of simulations increases.


d. Remove all the predictors that are not significant at the 5% level. Test this model against the original model. Which model is preffered?

```{r prostate_d}
m2 <- update(m, . ~ lcavol + lweight + svi)

anova(m, m2)
```

The reduced model is not significantly better than the full model so we would choose `m` over `m2`.

#### 2. Thirty samples of c