Skip to content

Commit 22abc9c

Browse files
authored
Update 04-ch4.Rmd
1 parent 048be68 commit 22abc9c

File tree

1 file changed

+36
-37
lines changed

1 file changed

+36
-37
lines changed

04-ch4.Rmd

+36-37
Original file line numberDiff line numberDiff line change
@@ -83,14 +83,13 @@ In order to account for these differences between observed data and the systemat
8383

8484
Which other factors are plausible in our example? For one thing, the test scores might be driven by the teachers' quality and the background of the students. It is also possible that in some classes, the students were lucky on the test days and thus achieved higher scores. For now, we will summarize such influences by an additive component:
8585

86-
$$ TestScore = \beta_0 + \beta_1 \times STR + \text{other factors} $$
86+
$$ TestScore = \beta_0 + \beta_1 \times STR + \text{other factors}. $$
8787

88-
Of course this idea is very general as it can be easily extended to other situations that can be described with a linear model. The basic linear regression model we will work with hence is
88+
Of course this idea is very general as it can be easily extended to other situations that can be described with a linear model. Hence, the basic linear regression model we will work with is
8989

9090
$$ Y_i = \beta_0 + \beta_1 X_i + u_i. $$
9191

92-
Key Concept 4.1 summarizes the terminology of the simple linear regression model.
93-
92+
Key Concept 4.1 summarizes the linear regression model and its terminology.
9493
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
9594
cat('
9695
<div class = "keyconcept" id="KC4.1">
@@ -99,16 +98,16 @@ cat('
9998
10099
<p> The linear regression model is
101100
102-
$$Y_i = \\beta_0 + \\beta_1 X_i + u_i$$
101+
$$Y_i = \\beta_0 + \\beta_1 X_i + u_i,$$
103102
104103
where
105104
106-
- the index $i$ runs over the observations, $i=1,\\dots,n$
107-
- $Y_i$ is the *dependent variable*, the *regressand*, or simply the *left-hand variable*
108-
- $X_i$ is the *independent variable*, the *regressor*, or simply the *right-hand variable*
109-
- $Y = \\beta_0 + \\beta_1 X$ is the *population regression line* also called the *population regression function*
110-
- $\\beta_0$ is the *intercept* of the population regression line
111-
- $\\beta_1$ is the *slope* of the population regression line
105+
- the index $i$ runs over the observations, $i=1,\\dots,n$;
106+
- $Y_i$ is the *dependent variable*, the *regressand*, or simply the *left-hand variable*;
107+
- $X_i$ is the *independent variable*, the *regressor*, or simply the *right-hand variable*;
108+
- $Y = \\beta_0 + \\beta_1 X$ is the *population regression line* also called the *population regression function*;
109+
- $\\beta_0$ is the *intercept* of the population regression line;
110+
- $\\beta_1$ is the *slope* of the population regression line;
112111
- $u_i$ is the *error term*.
113112
</p>
114113
</div>
@@ -117,15 +116,15 @@ where
117116

118117
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
119118
cat('\\begin{keyconcepts}[Terminology for the Linear Regression Model with a Single Regressor]{4.1}
120-
The linear regression model is $$Y_i = \\beta_0 + \\beta_1 X_1 + u_i$$
119+
The linear regression model is $$Y_i = \\beta_0 + \\beta_1 X_1 + u_i,$$
121120
where
122121
\\begin{itemize}
123-
\\item the index $i$ runs over the observations, $i=1,\\dots,n$
124-
\\item $Y_i$ is the \\textit{dependent variable}, the \\textit{regressand}, or simply the \\textit{left-hand variable}
125-
\\item $X_i$ is the \\textit{independent variable}, the \\textit{regressor}, or simply the \\textit{right-hand variable}
126-
\\item $Y = \\beta_0 + \\beta_1 X$ is the \\textit{population regression line} also called the \\textit{population regression function}
127-
\\item $\\beta_0$ is the \\textit{intercept} of the population regression line
128-
\\item $\\beta_1$ is the \\textit{slope} of the population regression line
122+
\\item the index $i$ runs over the observations, $i=1,\\dots,n$;
123+
\\item $Y_i$ is the \\textit{dependent variable}, the \\textit{regressand}, or simply the \\textit{left-hand variable};
124+
\\item $X_i$ is the \\textit{independent variable}, the \\textit{regressor}, or simply the \\textit{right-hand variable};
125+
\\item $Y = \\beta_0 + \\beta_1 X$ is the \\textit{population regression line} also called the \\textit{population regression function};
126+
\\item $\\beta_0$ is the \\textit{intercept} of the population regression line;
127+
\\item $\\beta_1$ is the \\textit{slope} of the population regression line;
129128
\\item $u_i$ is the \\textit{error term}.
130129
\\end{itemize}
131130
\\end{keyconcepts}
@@ -221,7 +220,7 @@ DistributionSummary
221220

222221
As for the sample data, we use `r ttcode("plot()")`. This allows us to detect characteristics of our data, such as outliers which are harder to discover by looking at mere numbers. This time we add some additional arguments to the call of `r ttcode("plot()")`.
223222

224-
The first argument in our call of `r ttcode("plot()")`, `r ttcode("score ~ STR")`, is again a formula that states variables on the y- and the x-axis. However, this time the two variables are not saved in separate vectors but are columns of `r ttcode("CASchools")`. Therefore, `r ttcode("R")` would not find them without the argument `r ttcode("data")` being correctly specified. `r ttcode("data")` must be in accordance with the name of the `r ttcode("data.frame")` to which the variables belong to, in this case `r ttcode("CASchools")`. Further arguments are used to change the appearance of the plot: while `r ttcode("main")` adds a title, `r ttcode("xlab")` and `r ttcode("ylab")` add custom labels to both axes.
223+
The first argument in our call of `r ttcode("plot()")`, `r ttcode("score ~ STR")`, is again a formula that states variables on the y- and the x-axis. However, this time the two variables are not saved in separate vectors but are columns of `r ttcode("CASchools")`. Therefore, `r ttcode("R")` would not find them without the argument `r ttcode("data")` being correctly specified. `r ttcode("data")` must be in accordance with the name of the `r ttcode("data.frame")` to which the variables belong to, in this case `r ttcode("CASchools")`. Further arguments are used to change the appearance of the plot: `r ttcode("main")` adds a title, `r ttcode("xlab")` and `r ttcode("ylab")` add custom labels to both axes.
225224

226225

227226
```{r, fig.align='center'}
@@ -243,15 +242,15 @@ cor(CASchools$STR, CASchools$score)
243242

244243
As the scatterplot already suggests, the correlation is negative but rather weak.
245244

246-
The task we are now facing is to find a line which best fits the data. Of course we could simply stick with graphical inspection and correlation analysis and then select the best fitting line by eyeballing. However, this would be rather subjective: different observers would draw different regression lines. On this account, we are interested in techniques that are less arbitrary. Such a technique is given by ordinary least squares (OLS) estimation.
245+
The task we are currently facing is to find a line that best fits the data. We could opt for graphical inspection and correlation analysis and then select the best fitting line by eyeballing. However, this would be rather subjective: different observers would draw different regression lines. On this account, we are interested in techniques that are less arbitrary. Such a technique is given by ordinary least squares (OLS) estimation.
247246

248247
### The Ordinary Least Squares Estimator {-}
249248

250249
The OLS estimator chooses the regression coefficients such that the estimated regression line is as "close" as possible to the observed data points. Here, closeness is measured by the sum of the squared mistakes made in predicting $Y$ given $X$. Let $b_0$ and $b_1$ be some estimators of $\beta_0$ and $\beta_1$. Then the sum of squared estimation mistakes can be expressed as
251250

252251
$$ \sum^n_{i = 1} (Y_i - b_0 - b_1 X_i)^2. $$
253252

254-
The OLS estimator in the simple regression model is the pair of estimators for intercept and slope which minimizes the expression above. The derivation of the OLS estimators for both parameters are presented in Appendix 4.1 of the book. The results are summarized in Key Concept 4.2.
253+
The OLS estimator in the simple regression model is the pair of estimators for intercept and slope that minimizes the expression above. The derivation of the OLS estimators for both parameters are presented in Appendix 4.1 of the book. The results are summarized in Key Concept 4.2.
255254

256255
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
257256
cat('
@@ -271,7 +270,7 @@ The OLS predicted values $\\widehat{Y}_i$ and residuals $\\hat{u}_i$ are
271270
\\hat{u}_i & = Y_i - \\widehat{Y}_i.
272271
\\end{align}
273272
274-
The estimated intercept $\\hat{\\beta}_0$, the slope parameter $\\hat{\\beta}_1$ and the residuals $\\left(\\hat{u}_i\\right)$ are computed from a sample of $n$ observations of $X_i$ and $Y_i$, $i$, $...$, $n$. These are *estimates* of the unknown population intercept $\\left(\\beta_0 \\right)$, slope $\\left(\\beta_1\\right)$, and error term $(u_i)$.
273+
The estimated intercept $\\hat{\\beta}_0$, the slope parameter $\\hat{\\beta}_1$ and the residuals $\\left(\\hat{u}_i\\right)$ are computed from a sample of $n$ observations of $X_i$ and $Y_i$,$i$=$1$, $...$,$n$. These are *estimates* of the unknown population intercept $\\left(\\beta_0 \\right)$, slope $\\left(\\beta_1\\right)$, and error term $(u_i)$.
275274
</p>
276275
277276
The formulas presented above may not be very intuitive at first glance. The following interactive application aims to help you understand the mechanics of OLS. You can add observations by clicking into the coordinate system where the data are represented by points. Once two or more observations are available, the application computes a regression line using OLS and some statistics which are displayed in the right panel. The results are updated as you add further observations to the left panel. A double-click resets the application, i.e., all data are removed.
@@ -292,7 +291,7 @@ The OLS estimators of the slope $\\beta_1$ and the intercept $\\beta_0$ in the s
292291
\\hat{u}_i & = Y_i - \\widehat{Y}_i.
293292
\\end{align*}
294293
295-
The estimated intercept $\\hat{\\beta}_0$, the slope parameter $\\hat{\\beta}_1$ and the residuals $\\left(\\hat{u}_i\\right)$ are computed from a sample of $n$ observations of $X_i$ and $Y_i$, $i$, $...$, $n$. These are \\textit{estimates} of the unknown true population intercept $\\left(\\beta_0 \\right)$, slope $\\left(\\beta_1\\right)$, and error term $(u_i)$.
294+
The estimated intercept $\\hat{\\beta}_0$, the slope parameter $\\hat{\\beta}_1$ and the residuals $\\left(\\hat{u}_i\\right)$ are computed from a sample of $n$ observations of $X_i$ and $Y_i$, $i$=$1$,$...$,$n$. These are \\textit{estimates} of the unknown true population intercept $\\left(\\beta_0 \\right)$, slope $\\left(\\beta_1\\right)$, and error term $(u_i)$.
296295
\\end{keyconcepts}
297296
')
298297
```
@@ -372,9 +371,9 @@ $R^2$, the *coefficient of determination*, is the fraction of the sample varianc
372371

373372
Since $TSS = ESS + SSR$ we can also write
374373

375-
$$ R^2 = 1- \frac{SSR}{TSS} $$
374+
$$ R^2 = 1- \frac{SSR}{TSS}, $$
376375

377-
where $SSR$ is the sum of squared residuals, a measure for the errors made when predicting the $Y$ by $X$. The $SSR$ is defined as
376+
where $SSR$ is the sum of squared residuals, a measure for the errors made when predicting $Y$ by $X$. The $SSR$ is defined as
378377

379378
$$ SSR = \sum_{i=1}^n \hat{u}_i^2. $$
380379

@@ -426,7 +425,7 @@ We find that the results coincide. Note that the values provided by `r ttcode("s
426425

427426
## The Least Squares Assumptions {#tlsa}
428427

429-
OLS performs well under a quite broad variety of different circumstances. However, there are some assumptions which need to be satisfied in order to ensure that the estimates are normally distributed in large samples (we discuss this in Chapter \@ref(tsdotoe).).
428+
OLS performs well under a quite broad variety of different circumstances. However, there are some assumptions which need to be satisfied in order to ensure that the estimates are normally distributed in large samples (we discuss this in Chapter \@ref(tsdotoe)).
430429

431430
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
432431
cat('
@@ -467,12 +466,12 @@ $X$ the error tends to be negative. We can use R to construct such an example. T
467466

468467
We will use the following functions:
469468

470-
* `r ttcode("runif()")` - generates uniformly distributed random numbers
471-
* `r ttcode("rnorm()")` - generates normally distributed random numbers
472-
* `r ttcode("predict()")` - does predictions based on the results of model fitting functions like `r ttcode("lm()")`
473-
* `r ttcode("lines()")` - adds line segments to an existing plot
469+
* `r ttcode("runif()")` - generates uniformly distributed random numbers.
470+
* `r ttcode("rnorm()")` - generates normally distributed random numbers.
471+
* `r ttcode("predict()")` - does predictions based on the results of model fitting functions like `r ttcode("lm()")`.
472+
* `r ttcode("lines()")` - adds line segments to an existing plot.
474473

475-
We start by creating a vector containing values that are uniformly distributed on the interval $[-5,5]$. This can be done with the function `r ttcode("runif()")`. We also need to simulate the error term. For this we generate normally distributed random numbers with a mean equal to $0$ and a variance of $1$ using `r ttcode("rnorm()")`. The $Y$ values are obtained as a quadratic function of the $X$ values and the error.
474+
We start by creating a vector containing values that are uniformly distributed on the interval $[-5,5]$. This can be done with the function `r ttcode("runif()")`. We also need to simulate the error term. For this we generate normally distributed random numbers with a mean of $0$ and a variance of $1$ using `r ttcode("rnorm()")`. The $Y$ values are obtained as a quadratic function of the $X$ values and the error.
476475

477476
After generating the data we estimate both a simple regression model and a quadratic model that also includes the regressor $X^2$ (this is a multiple regression model, see Chapter \@ref(rmwmr)). Finally, we plot the simulated data and add the estimated regression line of a simple regression model as well as the predictions made with a quadratic model to compare the fit graphically.
478477

@@ -511,19 +510,19 @@ legend("topleft",
511510
```
512511

513512

514-
The plot shows what is meant by $E(u_i|X_i) = 0$ and why it does not hold for the linear model:
513+
The plot above shows what is meant by $E(u_i|X_i) = 0$ and why it does not hold for the linear model:
515514

516515
Using the quadratic model (represented by the black curve) we see that there are no systematic deviations of the observation from the predicted relation. It is credible that the assumption is not violated when such a model is employed. However, using a simple linear regression model we see that the assumption is probably violated as $E(u_i|X_i)$ varies with the $X_i$.
517516

518517
### Assumption 2: Independently and Identically Distributed Data {-}
519518

520-
Most sampling schemes used when collecting data from populations produce i.i.d.-samples. For example, we could use `r ttcode("R")`'s random number generator to randomly select student IDs from a university's enrollment list and record age $X$ and earnings $Y$ of the corresponding students. This is a typical example of simple random sampling and ensures that all the $(X_i, Y_i)$ are drawn randomly from the same population.
519+
Most sampling schemes used when collecting data from populations produce i.i.d.-samples. For example, we can use `r ttcode("R")`'s random number generator to randomly select student IDs from a university's enrollment list and record age $X$ and earnings $Y$ of the corresponding students. This is a typical example of simple random sampling and ensures that all the $(X_i, Y_i)$ are drawn randomly from the same population.
521520

522521
A prominent example where the i.i.d. assumption is not fulfilled is time series data where we have observations on the same unit over time. For example, take $X$ as the number of workers in a production company over time. Due to business transformations, the company cuts jobs periodically by a specific share but there are also some non-deterministic influences that relate to economics, politics etc. Using `r ttcode("R")` we can easily simulate such a process and plot it.
523522

524523
We start the series with a total of 5000 workers and simulate the reduction of employment with an autoregressive process that exhibits a downward movement in the long-run and has normally distributed errors:^[See Chapter \@ref(ittsraf) for more on autoregressive processes and time series analysis in general.]
525524

526-
$$ employment_t = -50 + 0.98 \cdot employment_{t-1} + u_t $$
525+
$$ employment_t = -50 + 0.98 \cdot employment_{t-1} + u_t. $$
527526

528527
```{r, fig.align="center"}
529528
# set seed
@@ -563,7 +562,7 @@ Common cases where we want to exclude or (if possible) correct such outliers is
563562

564563
What does this mean? One can show that extreme observations receive heavy weighting in the estimation of the unknown regression coefficients when using OLS. Therefore, outliers can lead to strongly distorted estimates of regression coefficients. To get a better impression of this issue, consider the following application where we have placed some sample data on $X$ and $Y$ which are highly correlated. The relation between $X$ and $Y$ seems to be explained pretty well by the plotted regression line: all of the white data points lie close to the red regression line and we have $R^2=0.92$.
565564

566-
Now go ahead and add a further observation at, say, $(18,2)$. This observation clearly is an outlier. The result is quite striking: the estimated regression line differs greatly from the one we adjudged to fit the data well. The slope is heavily downward biased and $R^2$ decreased to a mere $29\%$! <br>
565+
Now go ahead and add a further observation at, say, $(18.2)$. This observation clearly is an outlier. The result is quite striking: the estimated regression line differs greatly from the one we adjudged to fit the data well. The slope is heavily downward biased and $R^2$ decreased to a mere $29\%$! <br>
567566
Double-click inside the coordinate system to reset the app. Feel free to experiment. Choose different coordinates for the outlier or add additional ones.
568567

569568
<iframe height="410" width="900" frameborder="0" scrolling="no" src="Outlier.html"></iframe>
@@ -767,7 +766,7 @@ Our variance estimates support the statements made in Key Concept 4.4, coming cl
767766
### Simulation Study 2 {-}
768767

769768
A further result implied by Key Concept 4.4 is that both estimators are consistent, i.e., they converge in probability to the true parameters we are interested in. This is because they are asymptotically unbiased and their variances converge to $0$ as $n$ increases. We can check this by repeating the simulation above for a sequence of increasing sample sizes. This means we no longer assign the sample size but a *vector* of sample sizes: `r ttcode("n <- c(...)")`. <br>
770-
Let us look at the distributions of $\beta_1$. The idea here is to add an additional call of `r ttcode("for()")` to the code. This is done in order to loop over the vector of sample sizes `r ttcode("n")`. For each of the sample sizes we carry out the same simulation as before but plot a density estimate for the outcomes of each iteration over `r ttcode("n")`. Notice that we have to change `r ttcode("n")` to `r ttcode("n[j]")` in the inner loop to ensure that the `r ttcode("j")`$^{th}$ element of `r ttcode("n")` is used. In the simulation, we use sample sizes of $100, 250, 1000$ and $3000$. Consequently we have a total of four distinct simulations using different sample sizes.
769+
Let us look at the distributions of $\beta_1$. The idea here is to add an additional call for `r ttcode("for()")` to the code. This is done in order to loop over the vector of sample sizes `r ttcode("n")`. For each of the sample sizes we carry out the same simulation as before but plot a density estimate for the outcomes of each iteration over `r ttcode("n")`. Notice that we have to change `r ttcode("n")` to `r ttcode("n[j]")` in the inner loop to ensure that the `r ttcode("j")`$^{th}$ element of `r ttcode("n")` is used. In the simulation, we use sample sizes of $100, 250, 1000$ and $3000$. Consequently we have a total of four distinct simulations using different sample sizes.
771770

772771

773772
```{r, fig.align='center', cache=T,fig.width=8, fig.height=8}
@@ -815,7 +814,7 @@ Furthermore, (4.1) reveals that the variance of the OLS estimator for $\beta_1$
815814
We can visualize this by reproducing Figure 4.6 from the book. To do this, we sample observations $(X_i,Y_i)$, $i=1,\dots,100$ from a bivariate normal distribution with
816815

817816
$$E(X)=E(Y)=5,$$
818-
$$Var(X)=Var(Y)=5$$
817+
$$Var(X)=Var(Y)=5,$$
819818
and
820819
$$Cov(X,Y)=4.$$
821820

0 commit comments

Comments
 (0)