- Assuming that the regression function
$E(Y|X)$ is linear in the inputs$X_1,...,X_p$ . - Simple and provide an adequate and interpretable description of how the inputs affect the output.
- Outperform in small numbers training cases, low signal-to-noise ratio or sparse data.
- In Chapter 5, basis-funciton methods will be discussed.
Suppose that the regression function
The variables
- quantitative inputs and its transformations (log, square-root or square);
- basis expansions, (e.g.
$X_2=X_1^2, X_3 = X_1^3$ , a polynomial representation); - numeric or "dummy" coding of the levels of qualitative inputs.
- interactions between variables (e.g.
$X_3=X_1X_2$ )
Suppose we have a set of training data
Statistical point of view
- this criterion is reasonable if
$(x_i, y_i)$ independent random draws from their population; - or if the
$y_i$ 's are conditionally independent given the inputs$x_i$ ; - the criterion measures the average lack of fit.
The columns of
The predicted values at an input vector
The columns of
(e.g.
-
$\mathbf{X}^T\mathbf{X}$ is singular and$\hat{\beta}$ are not uniquely defined. -
$\hat{\mathbf{y}}=\mathbf{X}\hat{\beta}$ are still the projection of$\mathbf{y}$ onto the column space of$\mathbf{X}$ (just more than one way to express). - The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion.
- There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns in
$\mathbf{X}$ . (most regression software packages detect and removing them automatically) - In image or signal analysis, sometimes
$p > N$ (rank deficiencies), the features can be reduced by filtering or regularization (Section 5.2.3 and Chapter 18.)
The sampling properties of
We now assume that the observations
Typically one estimates the variance
We now assume that (3.1) is the correct model, and the deviations of
This is a multivariate normal distribution with mean vector and variance–covariance matrix as shown. Also
proof of (3.6).
First note that $$\sum_{i=1}^N(y_i-\hat{y}i)^2 =|\mathbf{y}-\mathbf{X}\hat{\beta}|^2.$$ We have $$ \begin{aligned} \mathbf{y}-\mathbf{X}\hat{\beta} &= \mathbf{X}\beta+\varepsilon- \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\beta+\varepsilon)\ &=(I_N-H)\varepsilon \end{aligned} $$ Hence, $$ E(|\mathbf{y}-\mathbf{X}\hat{\beta}|^2) = E(\varepsilon^T(I_N-H)^T(I_N-H)\varepsilon)= E(\varepsilon^T(I_N-H)\varepsilon) $$ Note that $$ \varepsilon^T(I_N-H)\varepsilon = \sum{i,j}\varepsilon_i\varepsilon_j(\delta_{ij}-H_{ij}) $$ Thus, $$ \begin{aligned} E(|\mathbf{y}-\mathbf{X}\hat{\beta}|^2)&= E(\varepsilon^T(I_N-H)\varepsilon)\ &=\sum_{i=1}^N\sigma^2(\delta_{ii}-H_{ii})\ &= \sigma^2(N-\text{trace}(H))\ &= \sigma^2(N-p-1) \end{aligned} $$
proof of (3.8).
Since$\varepsilon\sim N(0, \sigma^2)$ , then$\hat{\beta}$ is normal distribution, $$\begin{aligned} E(\hat{\beta}) &= E((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y})\ &= E((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T (\mathbf{X}\beta+\varepsilon))\ &=\beta\end{aligned}$$ and $$ Var(\hat{\beta}) = Var((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}) = (\mathbf{X}^T\mathbf{X})^{-1}\sigma^2 $$
proof of (3.9).
Since$\varepsilon\sim N(0, \sigma^2)$ , from the proof of (3.6), we have $$ \mathbf{y}-\hat{\mathbf{y}}=\mathbf{y}-\mathbf{X}\hat{\beta} = (I_N-H)\varepsilon $$ Then, each element is normal distribution. According to the linearity of expectation, we obtain $$ E(\mathbf{y}-\hat{\mathbf{y}})=E(\mathbf{y}-\mathbf{X}\hat{\beta})= 0. $$ From the proof of (3.6), we also have $$ E(|\mathbf{y}-\hat{\mathbf{y}}|^2)= E(|\mathbf{y}-\mathbf{X}\hat{\beta}|^2) = \sigma^2(N-p-1) $$ Thus,$(N-p-1)\hat{\sigma}^2\sim \sigma^2 \mathcal{X}^2_{N-p-1}$ .
In addition
$z_j\sim t_{N-p-1}$ - a large (absolute) value of
$z_j$ will lead to rejection of this null hypothesis. - If
$\hat{\sigma}$ is replaced by a known value$\sigma$ , then$z_j\sim N(0,1)$ . The tail quantitles of$t$ -distribution and a standard normal become negligible as the sample size increases, so we can use the normal quantiles to approximate it. - we can isolate
$\beta_j$ to obtain a$1-2\alpha$ confidence interval for$\beta_j$ : $$\tag{3.11} (\hat{\beta}_j-z^{(1-\alpha)}v_j^{\frac{1}{2}}\hat{\sigma}, \hat{\beta}_j+z^{(1-\alpha)}v_j^{\frac{1}{2}}\hat{\sigma}), \quad \text{$z^{(1-\alpha)}$ is the$1-\alpha$ percentile of the normal distribution.} $$
When we need to test for the significance of groups of coefficients simultaneously. We can use the
(Exercise 3.1)
Proof of Exercise 3.1.
We can obtain an approximate confidence set for the entire parameter vector
proof of (3.13). From (3.8), it is easy to show that $$ (\mathbf{X}^T\mathbf{X})^{1/2}(\hat{\beta}-\beta) \sim N(0, \sigma^2I_N). $$ Then $$ (\hat{\beta}-\beta)^T\mathbf{X}^T\mathbf{X}(\hat{\beta}-\beta) \sim \sigma^2\mathcal{X}^2_{p+1}. $$ Combining this with (3.9), it gives $$ \frac{(\hat{\beta}-\beta)^T\mathbf{X}^T\mathbf{X}(\hat{\beta}-\beta)}{(p+1)\hat{\sigma}^2}\sim F_{p+1, N-p-1}. $$ On the other hand, one can prove if
$S\sim F_{m,n}, T =\lim_{n\to\infty} mS \sim \mathcal{X}^2_m$ by directly computing the limit of$mS$ 's pdf, with the help of relation between gamma function and beta function and Stirling's formula. With this claim, we have $$ \frac{(\hat{\beta}-\beta)^T\mathbf{X}^T\mathbf{X}(\hat{\beta}-\beta)}{\hat{\sigma}^2}\sim \mathcal{X}^2_{p+1}\quad (N\to \infty). $$
Example: Prostate Cancer (Page 50).
The least squares estimates of the parameters
We focus on estimation of any linear combination of the parameters
Considering
(Exercise 3.3) The Gauss-Markov theorem states that if we have any other linear estimator
Proof of Exercise 3.3
Consider the mean squared error of an estimator
Mean squared error is intimately related to prediction accuracy, as discussed in Chapter 2. Consider the prediction of the new response at input
The linear model with
In convenient vector notation, let
Suppoose that the inputs
Orthogonal inputs occur most often with balanced, designed experiments(where orthogonality is enforced), but almost never with observationaldata. Hence we will have to orthogonalize them in order to carry this idea further.
Gram-Schmidt Procedure
Algorithm 3.1 Regression by Successive Orthogonalization.
-
Initialize
$\mathbf{z}_0=\mathbf{x}_0=\mathbf{1}$ . -
For
$j=1,2,...,p$ Regress $\mathbf{x}j$ on $\mathbf{z}0, \mathbf{z}1, ..., \mathbf{z}{j-1}$ to produce coefficients $\hat{\gamma}{\ell j}= \langle \mathbf{z}{\ell}, \mathbf{x}j\rangle / \langle \mathbf{z}{\ell}, \mathbf{z}_{\ell} \rangle,\ell = 0,...,j-1$ and residual vector $\mathbf{z}j = \mathbf{x}j-\sum{k=0}^{j-1}\hat{\gamma}{kj}\mathbf{z}_k$.
-
Regress
$\mathbf{y}$ on the residual$\mathbf{z}_p$ to give the estimate$\hat{\beta}_p$ .
-
The result of this algorithm is
$$\tag{3.20} \hat{\beta}_p=\frac{\langle \mathbf{z}_p, \mathbf{y} \rangle}{\langle \mathbf{z}_p, \mathbf{z}_p \rangle}. $$ -
The inputs $\mathbf{z}_0, \mathbf{z}1,..., \mathbf{z}{j-1}$ in step 2 are orthogonal.
-
Since
$\mathbf{z}_p$ alone involves$\mathbf{x}_p$ (with coefficient 1), the coefficient (3.20) is indeed the coefficient of$\mathbf{y}$ on$\mathbf{x}_p$ . -
By rearranging the
$\mathbf{x}_j$ , any one of them could be in the last position, and a similar results holds. -
(Exercise 3.4) show that how the vector of least squares coefficients can be obtained from a single pass of the Gram–Schmidt procedure(Algorithm 3.1).
Proof of Exercise 3.4.
-
From (3.20) we also obtain an alternate formula for the variance estimates (3.5),
$$\tag{3.21} Var(\hat{\beta}_p) = \frac{\sigma^2}{\langle \mathbf{z}_p, \mathbf{z}_p\rangle} = \frac{\sigma^2}{|\mathbf{z}_p|^2}. $$ The precision with wich we can estimate
$\hat{\beta}_p$ depends on the length of the residual vector$\mathbf{z}_p$ ; this represents how mush of$\mathbf{x}_p$ is unexplained by the other$\mathbf{x}_k$ 's.
Suppose we have multiple ouputs
With
If the errors
arises naturally from multivariate Gaussian theory. Here
(Exercise 3.11) The solution is given by (3.25);
Proof of Exercise 3.11.
(Section 3.7 pursue the multiple outcome problem.)
Motivation: We are oftern not satisfied with the least squares estimates because of prediction accuracy (low bias but large variance) and interpretation (we would like to determine a smaller subset that exhibit the strongest effects).
The subset selection and controling variance method all fall under the general heading model selection. With subset selection we retain only a subset of the variables, and eliminate the rest from the model.
-
Best subset regression finds for each
$k\in{0, 1, 2, . . . , p}$ the subset of size$k$ that gives smallest residual sum of squares. (leaps and bounds procedure——makes this feasible for $p$ as large as 30 or 40) - The best-subset curve is necessarily decreasing, so cannot be used to select the subet size
$k$ . - There are a number of criteria that one may use; typically we choose the smallest model that minimizes an estimate of the expected prediction error.
- AIC criterion and cross-validation (Chapter 7)
- Forward-stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit.
- Exploit the QR decomposition for the current fit to rapidly establish the next candidate (Exercise 3.9).
Exercise 3.9
- Sub-optimal, greedy algorihtm
- Advantage: Computational (for large
$p$ ), Statistical (lower variance, but more bias.)
- Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit.
- The candidate for dropping is the variable with the smallest Z-score (Exercise 3.10).
Exercise 3.10
- only be used when
$N>p$ , while forward step wise can always be used.
Software packages implement hybrid stepwise-selection (select best of forward and backward, AIC score). Some traditional packages base the selection on
- More constrained than forward-stepwise regression
- It starts like forward-stepwise regression, with an intercept equal to
$\bar{y}$ , and centered predictors with coefficients initially all$0$ . At each step the algorithm identifies the variable most correlated with the current residual. It then computes the simple linear regression coefficient of the residual on this chosen variable, and then adds it to the current co- efficient for that variable. This is continued till none of the variables have correlation with the residuals—i.e. the least-squares fit when$N > p$ . - Forrward stagewise can take many more than
$p$ steps to reach the least squares fit, and historically has been dismissed as being inefficient. - The other variables can be adjusted when a term is added.
- "Slow fitting" in high-dimensional problems (Section 3.8.1).
Example: Prostate Cancer (Continued) (Page 61).
Subset selection produces is interpretable and has possibly lower prediction error than the full model. However, it is a discrete process (variables are either retained or discarded-> high variance). Shrinkage methods are more continuous, and don't suffer as much from high variability.
Ridge regression shrinks the regression coefficients by imposing a penalty on their size,
-
There is a one-to-one correspondence between the parameters
$\lambda$ and$t$ . -
When there are many correlated variables in a linear regression model, their coefficients can become poorly determined and exhibit high variance.
-
The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving (3.27).
-
(Exercise 3.5) The solution to (3.27) can be separated into two parts, after reparametrization using centered inpus. We estimate
$\beta_0$ by$\bar{y}$ . The remaining coefficients get estimated by a ridge regression without intercept, using the centered$x_{ij}$ . Henceforce, onece the centering has been donw, the input matrix$\mathbf{X}$ has$p$ columns.The criterion in (3.27) in matrix form,
$$\tag{3.29} \text{RSS}(\lambda) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta) + \lambda \beta^T\beta, $$ the ridge regression solutions are easily seen to be$$\tag{3.30} \hat{\beta}^{\text{ridge}} = (\mathbf{X}^T\mathbf{X}+\lambda I_p)^{-1}\mathbf{X}^T\mathbf{y}, $$ The problem is nonsigular since the solution add a positive constant to the disgonal of$\mathbf{X}^T\mathbf{X}$ .Exercise 3.5
-
(Exercise 3.6) Ridge regression can also be derived as the mean or mode of a posterior distribution, with a suitably chosen prior distribution.
Exercise 3.6 (Note: the mode of the posterior distribution is also posterior mean for the normal distribution.)
SVD Insights
The SVD of the
Exercise 3.8
The ridge solutions are
The columns of
The effective degrees of freedom of the ridge regression fit is $$\tag{3.35} \begin{aligned} \text{df}(\lambda) &= \text{trace}(\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda I_p)^{-1}\mathbf{X}^T)\ &= \text{trace}(\mathbf{H}{\lambda})\ &= \sum{j=1}^p\frac{d_j^2}{d_j^2+\lambda}. \end{aligned} $$ Of course there is always an additional one degree of freedom for the intercept, which was removed apriori.
The lasso (also known as basis pursuit) is a shrinkage method like ridge, with subtle but important differences.
Exercise 3.5
In Section 3.4.4 that efficient algorithms are available for computing the entire path of solutions as
- Ridge regression does a proportional shrinkage. Lasso translates each coefficient by a constant factor
$\lambda$ , truncating at zero. This is called “soft thresholding,” and is used in the context of wavelet-based smoothing in Section 5.9. Best-subset selection drops all variables with coefficients smaller than the $M$th largest; this is a form of “hard-thresholding.”
-
We can generlize ridge regression and the lasso, and view them as Bayes estimates. Consider the criterion
$$\tag{3.38} \hat{\beta}^{\text{lasso}} = \argmin_{\beta} \bigg{\sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j)^2+\lambda \sum_{j=1}^p|\beta_j|^q\bigg}. $$ -
$q=0$ : subset selection, cout the number of nonzero parameters. -
$q=1$ : the prior is an independent double exponential (or Laplace) distribution for each input. -
$q<1$ : the prior is not uniform in direction but concentrates more mass in the corrdinate directions. -
$q=2$ : the prior is an independent normal distribution.
In this view, the lasso, ridge regression and best subset selection are Bayes estimates with different priors.
-
-
Elastic-net penalty (Section 18.4)
$$\tag{3.39} \lambda \sum_{j=1}^p(\alpha \beta^2_j+(1-\alpha)|\beta_j|). $$
Algorithm 3.2 Least Angle Regression.
-
Strandardize the predictors to have mean zero and unit norm. Start with the residual
$\mathbf{r} = \mathbf{y} - \bar{\mathbf{y}}, \beta_1,\beta_2, ..., \beta_p =0$ . -
Find the predictor
$\mathbf{x}_j$ most correlated with$\mathbf{r}$ . -
Move
$\beta_j$ from$0$ towards its least-squares coefficient$\langle \mathbf{x}_j, \mathbf{r}\rangle$ , until some other competitor$\mathbf{x}_k$ has as much correlation with the current residual do does$\mathbf{x}_j$ . -
Move
$\beta_j$ and$\beta_k$ in the direction defined by their joint least squares coefficient of the current residual on$(\mathbf{x}_j, \mathbf{x}_k)$ , until some other competitor$\mathbf{x}_l$ has as much correlation with the current residual.4a.(Lasso Modification) If a non-zero coefficient hits zero, drop its variable from the active set of variables and recompute the current joint least squares direction. (gives the entire lasso path)
4b.(FS$0$ Modification) Find the new direction by solving the constrained least squares problem $$ \min_b|\mathbf{r}-\mathbf{X}{\mathcal{A}}b|_2^2 \quad \text{subject to }b_js_j\geq 0, j\in \mathcal{A}, $$ where
$s_j$ is the sign of$\langle \mathbf{x}_j, \mathbf{r} \rangle$ . -
Continue in this way until all
$p$ predictors have been entered. After$\min(N-1,p)$ steps, we arrive at the full least-squares solution.
- LAR is intimately connected with the lasso, and in fact provides an extremely efficient algorithm for computing the entire lasso path.
- Least angle regression uses a similar strategy with Forward stepwise regressionz, but only enters “as much” of a predictor as it deserves.
- At the first step it identifies the variable most correlated with the response. Rather than fit this variable completely, LAR moves the coefficient of this variable continuously toward its least-squares value (causing its correlation with the evolving residual to decrease in absolute value).
Suppose $\mathcal{A}k$ is the active set of variables at the beginning of the $k$th step, and let $\beta{\mathcal{A}_k}$ be the coefficient vector for these variables at this step; there will be
If $\mathbf{r}k = \mathbf{y}-\mathbf{X}{\mathcal{A}k}\beta{\mathcal{A}k}$ is the current residual, then the direction for this step is $$\tag{3.40} \delta_k=(\mathbf{X}{\mathcal{A}k}^T\mathbf{X}{\mathcal{A}k})^{-1}\mathbf{X}{\mathcal{A}_k}^T\mathbf{r}_k $$
The coefficient profile then evolves as $\beta_{\mathcal{A}k}(\alpha) = \beta{\mathcal{A}_k} + \alpha\cdot \delta_k.$
(Exercise 3.23) verifies that the directions chosen in this fashion do what is claimed: keep the correlations tied and decreasing.
Exercise 3.23
If
(Exercise 3.24)
Exercise 3.24
Note that we do not need to take small steps and recheck the correlations in step 3; using knowledge of the covariance of the predictors and the piecewise linearity of the algorithm, we can work out the exact step length at the beginning of each step (Exercise 3.25).
Exercise 3.25
Least angle regression always takes
Why lasso and LAR are so similar:
Suppose
(Exercise 3.27) shows that these equations imply a piecewise-linear coefficient profile as λ decreases.
Exercise 3.27
Degrees-of-Freedom Formula for LAR and Lasso
In classical statistics, the number of linearly independent parameters is what is meant by "degrees of freedom". How many parameters, or "degrees of freedom" have we used?
-
For a sutset of
$k$ features, if this subset is presparecified in advance without reference to the training data, then the degrees of freedomis$k$ . However, if a best subset selection to determine the optimal set of$k$ predictors, then in some sence, the degrees of freedom may more than$k$ . -
The definition of the degrees of freedom of the fitted vector $\hat{\mathbf{y}}=(\hat{y}_1, \hat{y}_2, ..., \hat{y}N)$ is $$\tag{3.44} \text{df}(\hat{\mathbf{y}}) = \frac{1}{\sigma^2} \sum{i=1}^NCov(\hat{y}_i, y_i). $$
$Cov(\hat{y}_i, y_i)$ refers to the covariance between the predicted value and output value. This makes intuitive sense: the harder that we fit to the data, the larger this covariance and hence$\text{df}(\hat{\mathbf{y}})$ . (This definition is motivated and discussed further in Sections 7.4-7.6).Note that in Section 3.2, we have assumed that the observations
$y_i$ are uncorrelated and have constant variance$\sigma^2$ , and that$x_i$ are fixed (non random). -
For a linear regression with
$k$ fixed predictors, it is easy to show that$\text{df}(\hat{\mathbf{y}})=k$ .Proof.
$$\begin{aligned} \text{df}(\hat{\mathbf{y}}) &= \frac{1}{\sigma^2} \sum_{i=1}^NCov(\hat{y}i, y_i).\ &= \frac{1}{\sigma^2}\sum{i=1}^NE((\hat{y}_i-E\hat{y}i)(y_i-Ey_i))\ &= \frac{1}{\sigma^2}E\bigg( \sum{i=1}^N(\hat{y}i-E\hat{y}i)(y_i-Ey_i)\bigg) \ &= \frac{1}{\sigma^2}E\bigg( (\hat{\mathbf{y}}-E\hat{\mathbf{y}})^T(\mathbf{y}-E\mathbf{y})\bigg) \ &= \frac{1}{\sigma^2} E\bigg((\mathbf{y}-E\mathbf{y})^TH^T (\mathbf{y}-E\mathbf{y}) \bigg)\ &= \frac{1}{\sigma^2} \sum{i=1}^N \sigma^2 H{ii} \ &= \text{trace}(H) \ &= k \end{aligned} $$ -
For ridge regression, this definition leads to the closed-form expression (3.35):
$\text{df}(\hat{\mathbf{y}})=\text{trace}(\mathbf{S}_{\lambda})$ .Proof.
Recall the expression (3.35), $$ \begin{aligned} \text{df}(\lambda) &= \text{trace}(\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda I_p)^{-1}\mathbf{X}^T)\ &= \text{trace}(\mathbf{H}{\lambda})\ &= \sum{j=1}^p\frac{d_j^2}{d_j^2+\lambda}. \end{aligned} $$ For lasso, the definition of$S_{\lambda} = H_{\lambda}$ , $$\begin{aligned} \text{df}(\hat{\mathbf{y}}) &= \frac{1}{\sigma^2} \sum_{i=1}^NCov(\hat{y}i, y_i).\ &= \frac{1}{\sigma^2}\sum{i=1}^NE((\hat{y}i-E\hat{y}i)(y_i-Ey_i))\ &= \frac{1}{\sigma^2}E\bigg( \sum{i=1}^N(\hat{y}i-E\hat{y}i)(y_i-Ey_i)\bigg) \ &= \frac{1}{\sigma^2}E\bigg( (\hat{\mathbf{y}}-E\hat{\mathbf{y}})^T(\mathbf{y}-E\mathbf{y})\bigg) \ &= \frac{1}{\sigma^2} E\bigg((\mathbf{y}-E\mathbf{y})^TH{\lambda}^T (\mathbf{y}-E\mathbf{y}) \bigg)\ &= \text{trace}(H{\lambda}) \ &= \text{trace}{(S{\lambda})} \end{aligned} $$ -
For best subset selection, there is no closed form method. It could be estimate by simulation.
-
For LAR, it can be shown that after the $k$th step of the LAR procedure, the effective degrees of freedom of the fit vector is approximately
$k$ .Theorem 3 from Efron et al.(2004):
Least angle regression. Annals of Statistics, 32, 407–451. -
For lasso and (modified) LAR, it often takes more than
$p$ steps, since predictors can drop out. Hence the definition is a little different.A detailed study may be found in Zou et al.(2007):
On the degrees of freedom of the lasso, Annals of Statistcs 35(5):2173-2192.
In many situations we have a large number of inputs, often very correlated. The methods in this section produce a small number of linear combinations
Principal component regression forms the derived input columns
Exercise 3.13.
- As with ridge regression, principal components depend on the scaling of the inputs, so typically we first standardize them.
- If
$M=p$ , we would just get back the uasual least squares estimates, since$\mathbf{Z}=\mathbf{U}\mathbf{D}$ span the column space of$\mathbf{X}$ . - If
$M<p$ , we get a reduced regression. - Ridge regression shrinks the regression coefficients of the principal components, using shrinkage factors
$d^2_j /(d^2_j + \lambda)$ . Principal component regression truncates them. - It can be shown (Exercise 3.15) that principal components regression seeks directions that have high variance, in particular, the $m$th principal component direction
$v_m$ solves:$$\tag{3.47} \max_{\alpha} Var(\mathbf{X}\alpha) \ \text{subject to }|\alpha|=1, \alpha^T\mathbf{S}v_{\ell}=0, \ell=1,...,m-1, $$ where$\mathbf{S}$ is the sample covariance matrix of the$\mathbf{x}_j$ .Exercise 3.15
This technique constructs linear combinations of the inputs for regression by using
Algorithm 3.3 Partial Least Squares.
- Standardize each
$\mathbf{x}_j$ to have mean zero and variance one. Set$\hat{\mathbf{y}}^{(0)}=\bar{y}\mathbf{1},$ and$\mathbf{x}_j^{(0)}=\mathbf{x}_j, j=1,...,p$ . - For
$m=1,2,...,p$
(a) $\mathbf{z}m=\sum{j=1}^p\hat{\phi}_{mj}\mathbf{x}j^{(m-1)}$, where $\hat{\phi}{mj}=\langle \mathbf{x}_j^{(m-1)}, \mathbf{y} \rangle$.
(b)$\hat{\theta}_m = \langle \mathbf{z}_m, \mathbf{y}\rangle/\langle \mathbf{z}_m, \mathbf{z}_m \rangle$ .
(c)$\hat{\mathbf{y}}^{(m)}=\hat{\mathbf{y}}^{(m-1)}+\hat{\theta}_m\mathbf{z}_m$ .
(d) Orthogonalize each$\mathbf{x}_j^{(m-1)}$ with respect to$\mathbf{z}_m$ :$\mathbf{x}^{(m)}_j=\mathbf{x}_j^{(m-1)}-[\langle \mathbf{z}_m, \mathbf{x}_j^{(m-1)}\rangle/\langle \mathbf{z}_m, \mathbf{z}_m \rangle]\mathbf{z}_m, j=1,2,...,p$ . - Output the sequence of fitted vectors ${\hat{\mathbf{y}}^{(m)}}1^p$. Since the ${\mathbf{z}{\ell}}_1^m$ are linear in the original
$\mathbf{x}_j$ , so is$\hat{\mathbf{y}}^{(m)}=\mathbf{X} \hat{\beta}^{\text{pls}}(m)$ . These linear coefficients can be recovered from the sequence of PLS transformations.
-
How to recover the linear coefficients
$\hat{\beta}^{\text{pls}}(m)$ ?(ZH): The best solution is that
$\mathbf{X}$ is full column rank. Most regression software packages detect these redundancies and automatically implement$x_1$ some strategy for removing them. -
Not scale invariant
-
If
$M=p$ , we would get back a solution equivalent to the usual least squares estimates. -
If
$M<p$ , it produces a reduced regression. -
The first directions
$\hat{\phi}_{1j}$ are the univariate regression coefficients (up to an irrelevant constant); this is not the case for subsequent directions. -
It can be shown (Exercise 3.15) that principal components regression seeks directions that have high variance, in particular, the $m$th PLS direction $\hat{\phi}m$ solves: $$\tag{3.48} \max{\alpha} Corr^2(\mathbf{y},\mathbf{X}\alpha)Var(\mathbf{X}\alpha) \ \text{subject to }|\alpha|=1, \alpha^T\mathbf{S}\hat{\phi}_{\ell}=0, \ell=1,...,m-1, $$ where
$\mathbf{S}$ is the sample covariance matrix of the$\mathbf{x}_j$ . Further analysis reveals that the variance aspect tends to dominate, and so partial least squares behaves much like ridge regression and principal components regression.Exercise 3.15
-
(Exercise 3.14) If the input matrix
$\mathbf{X}$ is orthogonal, then partial least squares finds the least squares estimats after$m=1$ steps. Subsequent step have no effect since the$\hat{\phi}_{mj}$ are zero for$m>1$ .Exercise 3.14.
-
(Exercise 3.18) the sequence of PLS coefficients for
$m =1,2,...,p$ represents the conjugate gradient sequence for computing the least squares solutions.Exercise 3.18.
-
The tuning parameters for ridge and lasso vary over a continuous range, while best subset, PLS and PCR take just two discrete steps to the least squares solution.
-
In the top panel, starting at the origin, ridge regression shrinks the coefficients together until it finally converges to least squares. PLS and PCR show similar behavior to ridge, although are discrete and more extreme. Best subset overshoots the solution and then backtracks. The behavior of the lasso is intermediate to the other methods.
-
When the correlation is negative (lower panel), again PLS and PCR roughly track the ridge path, while all of the methods are more similar to one another.
-
Ridge regression shrinks all directions, but shrinks low-variance directions more.
-
Principal components regression leaves
$M$ high-variance directions alone, and discards the rest. -
Partial least squares also tends to shrink the low-variance directions, but can actually inflate some of the higher variance directions. This can make PLS a little unstable, and cause it to have slightly higher prediction error compared to ridge regression.
Frank and Friedman (1993)
Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion), Technometrics 35(2): 109–148.
These authors conclude that for minimizing prediction error, ridge regression is generally preferable to variable subset selection, principal components regression and partial least squares. However the improvement over the latter two methods was only slight.
To summarize, PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps. Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.
Here we present another LAR-like algorithm, this time focused on forward stagewise regression. Interestingly, efforts to understand a flexible nonlinear regression procedure (boosting) led to a new algorithm for linear models (LAR).
Algorithm 3.4 Incremental Forward Stagewise Regression——$FS_{\epsilon}$.
- Start with the residual
$\mathbf{r}$ equal to$\mathbf{y}$ and$\beta_1,\beta_2,...,\beta_p=0$ . All the predictors are standardized to have mean zero and unit norm. - Find the predictor
$\mathbf{x}_j$ most correlated with$\mathbf{r}$ . - Update
$\beta_j \leftarrow \beta_j+\delta_j$ , where$\delta_j=\epsilon\cdot \text{sign}[\langle \mathbf{x}_j, \mathbf{r}\rangle]$ and$\epsilon >0$ is a small step size, and set$\mathbf{r} \leftarrow \mathbf{r} -\delta_j \mathbf{x}_j$ . - Repeat steps 2 and 3 many times, untile the residuals are uncorrelated with all the predictors.
Related algorithms and relationship:
- Section 3.4.4 LAR
- Chapter 16 Algorithm 16.1
- LAR least-squares fit amongst the tied predictors can result in coefficients moving in the opposite direction to their correla- tion, which cannot happen in Algorithm 3.4. (FS$_0$ Modification)
- As a consequence of these results, if the LAR profiles are monotone non-increasing or non-decreasing, as they are in Figure 3.19, then all three methods—LAR, lasso, and FS$_0$—give identical profiles. If the profiles are not monotone but do not cross the zero axis, then LAR and lasso are identical.
Suppose we solve
-
$R$ is quadratic or piecewise-quadratic as a function of$\beta$ , and -
$J$ is piecewise linear in$\beta$ .
This also implies (in principle) that the solution path can be efficiently computed. Another example is the “hinge loss” function used in the support vector machine. There the loss is piecewise linear, and the penalty is quadratic.
Candes and Tao (2007) proposed the following criterion:
$$\tag{3.51}
\min_{\beta}|\beta|1 \text{ subject to } |\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)|{\infty}\leq s.
$$
It can be written equivalently as
Candes and Tao (2007) show that the solution to DS is a linear pro- gramming problem; hence the name Dantzig selectors.
- the solution to DS is a linear programming problem;
- The Dantzig selector instead tries to minimize the maximum inner product of the current residual with all the predictors. Hence it can achieve a smaller maximum than the lasso, but in the process a curious phenomenon can occur. If the size of the active set is
$m$ , there will be$m$ variables tied with maximum correlation. However, these need not coincide with the active set! Hence it can include a variable in the model that has smaller correlation with the current residual than some of the excluded variables. - DS can yield extremely erratic coefficient paths as the regularization parameter
$s$ is varied.
In some problems, the predictors belong to pre-defined groups. Suppose that the
Some methods for
An alternate approach to the LARS algorithm for computing the lasso solution is simple coordinate descent. The idea is to fix the penalty parameter
Suppose the predictors are all standardized to have mean zero and unit norm. Denote by $\tilde{\beta}k(\lambda)$ the current estimate for $\beta_k$ at penalty parameter $\lambda$. We can rearrange (3.36) to isolate $\beta_j$,
$$\tag{3.54}
R(\tilde{\beta}(\lambda),\beta_j) = \frac{1}{2}\sum{i=1}^N\bigg(y_i-\sum_{k\neq j} x_{ik}\tilde{\beta}k(\lambda)-x{ij}\beta_j\bigg)^2+\lambda \sum_{k\neq j}|\tilde{\beta}k(\lambda)| + \lambda|\beta_j|,
$$
where we have suppressed the intercept and introduced a factor $\frac{1}{2}$ for convenience. This can be viewed as a univariate lasso problem with response variable the partial residual $y_i-\tilde{y}i^{(j)} = y_i-\sum{k\neq j}x{ik}\tilde{\beta}k(\lambda)$. This has a explicit solution, resulting in the update
$$\tag{3.55}
\tilde{\beta}j \leftarrow S\bigg(\sum{i=1}^Nx{ij}(y_i-\tilde{y}i^{(j)}), \lambda\bigg).
$$
Here $S(t,\lambda)=\text{sign}(t)(|t|-\lambda)+$ is the soft-thresholding operator. Repeated iteration of (3.55)—cycling through each variable in turn until convergence—yields the lasso estimate
We can also use this simple algorithm to efficiently compute the lasso solutions at a grid of values of
A key to its speed is the fact that the quantities in (3.55) can be updated quickly as
The same kind of algorithm can be applied to the elastic net, the grouped lasso and many other models in which the penalty is a sum of functions of the individual parameters (Friedman et al., 2010). It can also be applied, with some substantial modifications, to the fused lasso (Section 18.4.2); details are in Friedman et al. (2007).
Cholesky decomposition of the matrix