Skip to content

Files

Latest commit

5050f8f · Feb 21, 2021

History

History
576 lines (457 loc) · 44.8 KB

Chap4-Linear-Methods-for-Classification.md

File metadata and controls

576 lines (457 loc) · 44.8 KB

4 Linear Methods for Classification

4.1 Introduction

This chapter focos on linear method for classification (the decision boundaries are linear). Suppose predictor G ( x ) takes values in a distrete set G , we can always divide the input space into a collection of regions labeled according to the classification.

Suppose, there are K classes, for convenience labeled 1 , 2 , . . . , K . There are several different ways in which linear decision boundaries can be found.

Given a model discriminant functions δ k ( x ) for each class, and classify x to the class with the largest value for its discriminant function. The regression approach and that model the posterior probabilities Pr ( G = k | X = x ) are the members of this class of methods as soon as the monotone transformation of δ k ( x ) and Pr ( G = k | X = x ) are linear for the decision boundaries to be linear.

  • The regression approach: Supppose the fitted linear model for the $k$th indicator response variable is $\hat{f}k(x)=\hat{\beta}{k0}+\hat{\beta}k^Tx$. The decision boundary between class $k$ and $l$ is that set of points for which $\hat{f}k(x)=\hat{f}{\ell}(x)$, that is the set ${x:(\hat{\beta}{k0}-\hat{\beta}_{\ell 0})+(\hat{\beta}k-\hat{\beta}{\ell})^Tx = 0}$, an affine set or hyperplane.
  • The posterior probilities: The common useful model is $$\tag{4.1} \begin{aligned} \Pr(G=k|X=x) &= \frac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{\ell=1}^{K-1}\exp(\beta_{\ell 0}+\beta_{\ell}^Tx)}, k = 1,..., K-1\ \Pr(G=K|X=x) &= \frac{1}{1+\sum_{\ell=1}^{K-1}\exp(\beta_{\ell 0}+\beta_{\ell}^Tx)}. \end{aligned} $$ Here the monotone transformation is the logit trainsformation and in fact we see that log Pr ( G = k | X = x ) Pr ( G = | X = x ) = ( β k 0 β 0 ) + ( β k β ) T x . The decision boundary is the set of points for which the log-odds are zero. We discuss two very popular but different methods that result in linear log-odds or logits: linear discriminant analysis and linear logistic regression. Although they differ in their derivation, the essential difference between them is in the way the linear function is fit to the training data.
  • Explicitly model the boundaries between the classes as linear: The first is the well-known perceptron, the second is optimally separating hyperplane. We treat the separable case here, and defer treatment of the nonseparable case to Chapter 12.

While this entire chapter is devoted to linear decision boundaries, there is considerable scope for generalization. For example, expand the variable set by including their squares and cross-products. This approach can be used with any basis transformation h ( X ) where h : R p R q with q > p .

4.2 Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable. Then if G has K classes, there will be K such indicators Y k , k = 1 , . . . , K with Y k = 1 if G = k else 0 . These are collected together in a vector Y = ( Y 1 , . . . , Y K ) , and the N training instances of there form an N × K indicator response matrix Y (0-1 matrix, each row having a single 1). We fit a linear regression model to each columns of Y simultaneously, and the fit is given by Y ^ = X ( X T X ) 1 X T Y . Note that

  • the coefficient vector for each response column y k ;
  • the B ^ = ( X T X ) 1 X T Y has ( p + 1 ) × K coefficients;
  • compute the fitted output f ^ ( x ) T = ( 1 , x T ) B ^ , a K vector;
  • identify the largest component and classify accordingly: G ^ ( x ) = \argmax k G f ^ k ( x ) .

What is the rationable for this approach? We know that the regression as an estimate of conditional expectation. So, for the random variable Y k , E ( Y k | X = x ) = Pr ( G = k | X = x ) , so conditional expectation of each of the Y k seems a sensible goal. However, the f ^ k ( x ) can not be reasonable estimates of the posterior probabilities Pr ( G = k | X = x ) (could be negative or greater than 1). In fact on many problems it gives similar results to more standard linear methods for classification. If we allow linear regression onto basis expansions h ( X ) of the inputs, this approach can lead to consistent estimates of the probabilities. As the size of the training set N grows bigger, we adaptively include more basis elements so that linear regression onto these basis functions approaches conditional expectation. We discuss such approaches in Chapter 5.

A more simplistic viewpoint is to construct targets t k for each class, where t k is the $k$th column of the K × K identity matrix. With the same coding as before, the response vector y i ($i$th row of Y ) for observation i has the value y k = t k if g i = k . We might then fit the linear model by least squares: min B i = 1 N | y i [ ( 1 , x i T ) B ] T | 2 . The criterion is a sum-of-squared Euclidean distances of the fitted vectors from their targets. A new observation is classified by computing its fitted vector f ^ ( x ) and classifying to the cloest target: G ^ ( x ) = \argmin k | f ^ ( x ) t k | 2 . This is exactly the same as the previous approach:

  • The closest target classification rule (4.6) is easily seen to be exactly the same as the maximum fitted component criterion (4.4).

There is a serious problem with the regression approach when the number of classes K 3 , especially prevalent when K is large. Because of the rigid nature of the regression model, classes can be masked by others. See figure 4.2 and 4.3.

For the cases in figure 4.3, if there are K 3 classes are lined up, polynomial terms up to degress K 1 might be needed to resolve them. So in p -dimensional input space, one would need general polynomial terms and cross-products of total degree K 1 , O ( p K 1 ) terms in all, to resolve such worst-case scenarios. The example is extreme, but for large K and small p such maskings naturally occur. As a more realistic illustration, Figure 4.4 is a projection of the training data for a vowel recognition problem onto an informative two-dimensional subspace.

4.3 Linear Discriminant Analysis

Decision theory for classification (Section 2.4) tells us that we need to know the class posteriors Pr ( G | X ) for optimal classification. Suppose f k ( x ) is the class-conditional density of X in class G = k , and let π k be the prior probability of class k , with k = 1 K π k = 1 . A simple application of Bayes theorem gives us Pr ( G = k | X = x ) = f k ( x ) π k = 1 K f ( x ) π . We see that in terms of ability to classify, having the f k ( x ) is almost equivalent to having the quantity Pr ( G = k | X = x ) .

Many techniques are based on models for the class densities:

  • linear and quadratic discriminant analysis use Gaussian densities;
  • more flexible mixtures of Gaussians allow for nonlinear decision boundaries (Section 6.8);
  • general nonparametric density estimates for each class density allow the most flexibility (Section 6.6.2);
  • Naive Bayes models are a variant of the previous case, and assume that each of the class densities are products of marginal densities; that is, they assume that the inputs are conditionally independent in each class (Section 6.6.3).

Suppose that we model each class density as multivariate Gaussian f k ( x ) = 1 ( 2 π ) p / 2 | Σ k | 1 / 2 e 1 2 ( x μ k ) T Σ k 1 ( x μ k ) .

Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix Σ k = Σ k . In comparing two classes k and , it is sufficient to look at the log-ratio, and we see that $$\tag{4.9} \begin{aligned} \log \frac{\Pr(G=k|X=x)}{\Pr(G=\ell|X=x)} &= \log\frac{f_k(x)}{f_{\ell}(x)}+\log\frac{\pi_k}{\pi_{\ell}}\ &=\log\frac{\pi_k}{\pi_{\ell}}-\frac{1}{2}(\mu_k+\mu_{\ell})^T\Sigma^{-1}(\mu_k-\mu_{\ell})+x^T\Sigma^{-1}(\mu_k-\mu_{\ell}), \end{aligned} $$ an equation linear in x . For any pair of classes k , , the decision boundary is the set where Pr ( G = k | X = x ) = Pr ( G = | X = x ) is linear in x ; in p dimensions a hyperplane. Figure 4.5 shows an example with three classes from three Gaussian distributions with a common covariance matrix and p = 2 .

Notice that the decision boundaries are not the perpendicular bisectors of the line segments joining the centroids. This (perpendicular) would be the case if the covariance Σ were spherical σ 2 I , and the class priors were equal. From (4.9) we see that linear discriminant functions

δ k ( x ) = x T Σ 1 μ k 1 2 μ k T Σ 1 μ k + log π k are an equivalent description of the decision rule, with G ( x ) = \argmax k δ k ( x ) .

In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using our training data:

  • π ^ k = N k / N , where N k is the number of class-$k$ observations;
  • $\hat{\mu}k=\sum{g_i=k}x_i/N_k$;
  • Σ ^ = k = 1 K g i = k ( x i μ ^ k ) ( x i μ ^ k ) T / ( N K ) .

With two classes, the LDA rule classifies to class 2 if x T Σ ^ 1 ( μ ^ 2 μ ^ 1 ) > 1 2 ( μ ^ 2 + μ ^ 1 ) T Σ 1 ( μ ^ 2 μ ^ 1 ) log ( N 2 / N 1 ) . Suppose we code the targets in the two classes as +1 and −1, respectively. It is easy to show that the coefficient vector from least squares is proportional to the LDA direction given in (4.11) (Exercise 4.2). [In fact, this correspondence occurs for any (distinct) coding of the targets; see Exercise 4.2]. However unless N 1 = N 2 the intercepts are different and hence the resulting decision rules are different.

Exercise 4.2.

Since this derivation of the LDA direction via least squares does not use a Gaussian assumption for the features, its applicability extends beyond the realm of Gaussian data. However the derivation of the particular intercept or cut-point given in (4.11) does require Gaussian data. Thus it makes sense to instead choose the cut-point that empirically minimizes training error for a given dataset. This is something we have found to work well in practice, but have not seen it mentioned in the literature.

With more than two classes, LDA is not the same as linear regression of the class indicator matrix, and it avoids the masking problems associated with that approach (Hastie et al., 1994). A correspondence between regres- sion and LDA can be established through the notion of optimal scoring, discussed in Section 12.5.

Getting back to the general discriminant problem (4.8), if the Σ k are not assumed to be equal, then the convenient cancellations in (4.9) do not occur, we then get quadratic discriminant functions (QDA), δ k ( x ) = 1 2 log | Σ k | 1 2 ( x μ k ) T Σ k 1 ( x μ k ) + log π k . The decision boundary between each pair of classes k and is described by a quadratic equation x : δ k ( x ) = δ ( x ) .

Figure 4.6 shows an example (from Figure 4.1 on page 103) where the three classes are Gaussian mixtures (Section 6.8) and the decision boundaries are approximated by quadratic equations in x .

For this figure and many similar figures in the book we compute the decision bound- aries by an exhaustive contouring method. We compute the decision rule on a fine lattice of points, and then use contouring algorithms to compute the boundaries.

When p is large this can mean a dramatic increase in parameters. For LDA, it seems there are ( K 1 ) × ( p + 1 ) parameters, while for QDA there will be ( K 1 ) × p ( p + 1 ) / 2 + p + 1 parameters. Both LDA and QDA perform well on an amazingly large and diverse set of classification tasks. The question arises why LDA and QDA have such a good track record. The reason is not likely to be that the data are approximately Gaussian, and in addition for LDA that the covariances are approximately equal. More likely a reason is that the data can only support simple decision boundaries such as linear or quadratic, and the estimates provided via the Gaussian models are stable. This is a bias variance tradeoff—we can put up with the bias of a linear decision boundary because it can be estimated with much lower variance than more exotic alternatives. This argument is less believable for QDA, since it can have many parameters itself, although perhaps fewer than the non-parametric alternatives.

4.3.1 Regularized Discriminant Analysis

Friedman (1989) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. These methods are very similar in flavor to ridge regression. The regularized covariance matrices have the form Σ ^ k ( α ) = α Σ ^ k + ( 1 α ) Σ ^ , where α [ 0 , 1 ] and Σ ^ is the pooled covariance matrix as used in LDA. In practice α can be chosen based on the performance of the model on validation data, or by cross-validation.

Figure 4.7 shows the results of RDA applied to the vowel data.

Similar modifications allow $\hat{\Sigma}$ itself to be shrunk toward the scalar covariance,

(4.14) Σ ^ ( γ ) = γ Σ ^ + ( 1 γ ) σ ^ 2 I

for γ [ 0 , 1 ] . Replacing Σ ^ in (4.13) by Σ ^ ( γ ) leads to a more general family of covariances Σ ^ ( α , γ ) indexed by a pair of parameters.

In Chapter 12, we discuss other regularized versions of LDA, which are more suitable when the data arise from digitized analog signals and images. In Chapter 18 we also deal with very high-dimensional problems, where for example the features are gene- expression measurements in microarray studies. There the methods focus on the case γ = 0 in (4.14), and other severely regularized versions of LDA.

4.3.2 Computations for LDA

We briefly digress on the computations required for LDA and especially QDA. By the eigen decomposition for each

Double subscripts: use braces to clarify

$\hat{\Sigma}_k=\mathbf{U_kD_kU_k}^T$
, where U k is p × p orthonormal and $\mathbf{D}k$ a diagonal matrix of positive eigenvalues $d{k\ell}$. Then the ingredients for δ k ( x ) (4.12) are

  • ( x μ ^ k ) T Σ ^ k 1 ( x μ ^ k ) = [ U k T ( x μ ^ k ) ] T D k 1 [ U k T ( x μ ^ k ) ] ;
  • $\log|\hat{\Sigma}k|=\sum{\ell}\log d_{k\ell}$.

In light of the computational steps outlined above, the LDA classifier can be implemented by the following pair of steps:

  • Sphere the1data with respect to the common covariance estimate Σ ^ : X D 1 2 U T X , where Σ ^ = UDU T . The common covariance estimate of X will now be the identity.

    Since Σ ^ = UDU T , $$ \hat{\Sigma}^* = \mathbf{D}^{-\frac{1}{2}}\mathbf{U}^T\hat{\Sigma}\mathbf{U}\mathbf{D}^{-\frac{1}{2}} = \mathbf{D}^{-\frac{1}{2}}\mathbf{U}^T\mathbf{U}\mathbf{D}\mathbf{U}^T\mathbf{U}\mathbf{D}^{-\frac{1}{2}} = \mathbf{I}.

    $$

  • Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities π k .

    $$\delta^_k(x)=-\frac{1}{2}|x^-\hat{\mu}^*_k|_2^2+\log \pi_k$$

4.3.3 Reduced-Rank Linear Discriminant Analysis

LDA as a restricted Gaussian classifier allow us to view informative low-dimensional projections of the data. The K centroids in p -dimensional input space lie in an affine subspace of dimension K 1 , and if p is much larger than K , this will be a considerable drop in dimension. Moreover, in locating the closest centroid, we can ignore distances orthogonal to this subspace, since they will contribute equally to each class. Thus there is a fundamental dimension reduction in LDA, namely, that we need only consider the data in a subspace of dimension at most K 1 .

We might then ask for a L < K 1 dimensinal subspace H L H K 1 optimal for LDA in some sense. Fisher defined optimal to mean that the projected centroids were spread out as much as possible in terms of variance. This amounts to finding principal component subspaces of the centroids themselves.

Figure 4.4 shows such an optimal two-dimensional subspace for the vowel data. Here there are eleven classes, each a different vowel sound, in a ten-dimensional input space. The centroids require the full space in this case, since K 1 = p , but we have shown an optimal two-dimensional subspace. The dimensions are ordered, so we can compute additional dimensions in sequence.

Figure 4.8 shows four additional pairs of coordinates, also known as canonical or discriminant variables.

In summary, finding the sequences of optimal subspaces for LDA involves the following steps:

  • compute the K × p matrix of class centroids M and the common covariance matrix W (for within-class covariance);
  • compute M = M W 1 2 using the eigen-decomposition of W ;
  • compute $\mathbf{B}^$, the covariance matrix of $\mathbf{M}^$ ( B for between-class covariance), and its eigen-decomposition $\mathbf{B}^=\mathbf{V}^\mathbf{D}{B}\mathbf{V}^{*T}$. The columns $v{\ell}^$ of $\mathbf{V}^$ in sequence from first to last define the coordinates of the optimal subspaces.

Combining all these operations the $\ell$th discriminant variable is given by Z = v T X with v = W 1 2 v .

Fisher arrived at this decomposition via a different route, without referring to Gaussian distributions at all. He posed the problem:

Find the linear combination $Z = a^TX$ such that the between-class variance is maximized relative to the within-class variance.

The between class variance matrix is the variance of the class means of X , $$ \mathbf{B} = \sum_{k=1}^{K}\sum_{g_{i}=k}{(\hat{\mu}{k}-\hat{\mu})(\hat{\mu}{k}-\hat{\mu})^{T}/(K-1)}, $$

the within class variance is the pooled variance about the means $$ \mathbf{W} = \sum_{k=1}^{K}\sum_{g_{i}=k}{(x_{i}-\hat{\mu}{k})(x{i}-\hat{\mu}{k})^{T}/(N-K)}, $$ The total covariance matrix of $X$, ignoring class information $$ \mathbf{T} = \sum{k=1}^K\sum_{g_i=k}(x_i-\hat{\mu})(x_i-\hat{\mu})^T/(N-1). $$ It is easy to prove that T = B + W . The between-class variance of Z is a T B a and the within-class variance is a T W a .

Figure 4.9 shows why this criterion makes sense.

Fisher's problem therefore amounts to maximizing the Rayleigh quotient, max a a T B a a T W a , or equivalently max a a T B a  subject to  a T W a = 1.

which can rewrite after the convenient basis change a = W 1 / 2 a , $a^T\mathbf{B}a = a^{T}\mathbf{W}^{-1/2}\mathbf{B}\mathbf{W}^{-1/2}a^ = a^{T}\mathbf{B}^a^$, $$ \min_{a^}-\frac{1}{2}a^{T}\mathbf{B}^a^ \text{ subject to } a^{T}a^=1. $$ The Lagrangien for this problem writes $$ L=-\frac{1}{2}a^{T}\mathbf{B}^a^+\frac{1}{2}\lambda(a^{T}a^-1). $$ and the Karush-Kuhn-Tucker conditions give $$ \mathbf{B}^a^=\lambda a^ \equiv \mathbf{W}^{-1}\mathbf{B}a=\lambda a. $$ Thus, the optimal $a^$ corresponding the largest eigenvalue of $\mathbf{B}^$, that is $v_{1}^$. And a given by the largest eigenvalue of W 1 B . Similarly one can find the next direction $v_{2}^$, and so on. $v_{\ell}=\mathbf{W}^{-\frac{1}{2}}v_{\ell}^$. It is not hard to show (Exercise 4.1) that the optimal a 1 is identical to v 1 defined above.

The a are referred to as discriminant coordinates, not to be confused with discriminant functions. They are also referred to as canonical variates, since an alternative derivation of these results is through a canonical correlation analysis of the indicator response matrix Y on the predictor matrix X . This line is pursued in Section 12.5.

To summarize the developments so far:

  • Gaussian classification with common covariances leads to linear deci- sion boundaries. Classification can be achieved by sphering the data with respect to W , and classifying to the closest centroid (modulo log π k ) in the sphered space.
  • Since only the relative distances to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.
  • This subspace can be further decomposed into successively optimal subspaces in term of centroid separation. This decomposition is identical to the decomposition due to Fisher.

One can show that this is a Gaussian classification rule with the additional restriction that the centroids of the Gaussians lie in a L -dimensional subspace of R p . Fitting such a model by maximum likelihood, and then constructing the posterior probabilities using Bayes’ theorem amounts to the classification rule described above (Exercise 4.8).

Exercise 4.8

Gaussian classification dictates the logπk correction factor in the dis- tance calculation. The reason for this correction can be seen in Figure 4.9. The misclassification rate is based on the area of overlap between the two densities. If the π k are equal (implicit in that figure), then the optimal cut-point is midway between the projected means. If the π k are not equal, moving the cut-point toward the smaller class will improve the error rate. As mentioned earlier for two classes, one can derive the linear rule using LDA (or any other method), and then choose the cut-point to minimize misclassification error over the training data.

Figure 4.10 shows the results. Figure 4.11 shows the decision boundaries for the classifier based on the two-dimensional LDA solution.

There is a close connection between Fisher’s reduced rank discriminant analysis and regression of an indicator response matrix. It turns out that LDA amounts to the regression followed by an eigen-decomposition of Y ^ T Y . In the case of two classes, there is a single discriminant variable that is identical up to a scalar multiplication to either of the columns of Y ^ . These connections are developed in Chapter 12. A related fact is that if one transforms the original predictors X to Y ^ , then LDA using Y ^ is identical to LDA in the original space (Exercise 4.3).

Exercise 4.3.

4.4 Logistic Regression

The logistic regression model arises from the desire to model the posterior probabilities of the K classes via linear functions in x , while at the same time ensuring that they sum to one and remain in [ 0 , 1 ] . $$\tag{4.18} \begin{aligned} \Pr(G=k|X=x) &= \frac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{\ell=1}^{K-1}\exp(\beta_{\ell 0}+\beta_{\ell}^Tx)}, k = 1,..., K-1\ \Pr(G=K|X=x) &= \frac{1}{1+\sum_{\ell=1}^{K-1}\exp(\beta_{\ell 0}+\beta_{\ell}^Tx)}. \end{aligned} $$ The entire parameter set θ = β 10 , β 1 T , . . . , β ( K 1 ) 0 , β K 1 T , we denote the probabilities Pr ( G = k | X = x ) = p k ( x ; θ ) .

4.4.1 Fitting Logistic Regression Models

Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of G given X . Since Pr ( G | X ) completely specifies the conditional distribution, the multinomial distribution is appropriate. The log-likelihood for N observations is ( θ ) = i = 1 N log p g i ( x i ; θ ) .

In the two-class case, via a 0 / 1 response y i , the log-likelihood can be written ( β ) = i = 1 N y i log p ( x i ; β ) + ( 1 y i ) log ( 1 p ( x i ; β ) ) we assume that the vector of inputs x i includes the constant term 1 to accommodate the intercept.

Let X be the N × ( p + 1 ) matrix of x i values, p the vector of fitted probabilities with $i$th element p ( x i ; β old ) and W a N × N diagonal matrix of weights with $i$th diagonal element p ( x i ; β old ) ( 1 p ( x i ; β old ) ) . Then we have the score equation ( β ) β = X T ( y p ) 2 ( β ) β β T = X T W X . The Newton step is thus $$\tag{4.26} \begin{aligned} \beta^{\text{new}} &= \beta^{\text{old}} + (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p})\ &= (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}(\mathbf{X}\beta^{\text{old}}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}))\ &=(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z}. \end{aligned} $$ since at each iteration p changes, and hence so does W and z .In the second and third line we have re-expressed the Newton step as a weighted least squares step, with the response z = X β old + W 1 ( y p ) . This algorithm is referred to as iteratively reweighted least squares or IRLS, since each iteration solves the weighted least squares problem: β new \argmin β ( z X β ) T W ( z X β ) . It seems that β = 0 is a good starting value for the iterative procedure, although convergence is never guaranteed. Typically the algorithm does converge, since the log-likelihood is concave, but overshooting can occur. In the rare cases that the log-likelihood decreases, step size halving will guarantee convergence.

For the multiclass case ( K 3 ) the Newton algorithm can also be expressed as an iteratively reweighted least squares algorithm, but with a vector of K 1 responses and a nondiagonal weight matrix per observation. The latter precludes any simplified algorithms, and in this case it is numerically more convenient to work with the expanded vector θ directly (Exercise 4.4).

Exercise 4.4.

Alternatively coordinate-descent methods (Section 3.8.6) can be used to maximize the log-likelihood efficiently.

Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables. in explaining the outcome. Typically many models are fit in a search for a parsimonious model involving a subset of the variables, possibly with some interactions terms. The following example illustrates some of the issues involved.

  • Example: South African Heart Disease

4.4.3 Quadratic Approximations and Inference

The maximum-likelihood parameter estimates β ^ satisfy a self-consistency relationship: they are the coefficients of a weighted least squares fit, where the responses are z i = x i T β ^ + ( y i p ^ i ) p ^ i ( 1 p ^ i ) , and the weights are w i = p ^ i ( 1 p ^ i ) , both depending on β ^ itself. Apart from providing a convenient algorithm, this connection with least squares has more to offer:

  • The weighted residual sum-of-squares is the familiar Pearson chi-square statistic i = 1 N ( y i p ^ i ) 2 p ^ i ( 1 p ^ i ) , a quadratic approximation to the deviance.
  • Asymptotic likelihood theory says that if the model is correct, then β ^ is consistent (i.e., converges to the true β ).
  • A central limit theorem then shows that the distribution of β ^ converges to N ( β , ( X T W X ) 1 ) . This and other asymptotics can be derived directly from the weighted least squares fit by mimicking normal theory inference.

    For the weighted least squares, the estimated parameter values are linear combinations of the observed values β ^ = ( X T W X ) 1 X T W y . Therefore, an expression for the estimated variance-covariance matrix of the parameter estimates can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the estimated parameters by M β . Then $$ \mathbf{M}^{\beta} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{M}\mathbf{W}^T\mathbf{X}(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}. $$ when W = M 1 , this simplifies to $$ \mathbf{M}^{\beta} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}. $$ When unit weights are used ( W = I , the identity matrix), it is implied that the experimental errors are uncorrelated and all equal: M = σ 2 I , where σ 2 is the a priori variance of an observation.

  • Model building can be costly for logistic regression models, because each model fitted requires iteration. Popular shortcuts are the Rao score test which tests for inclusion of a term, and the Wald test which can be used to test for exclusion of a term. Neither of these require iterative fitting, and are based on the maximum-likelihood fit of the current model. It turns out that both of these amount to adding or dropping a term from the weighted least squares fit, using the same weights. Such computations can be done efficiently, without recomputing the entire weighted least squares fit.
  • GLM (generalized linear model) objects can be treated as linear model objects, and all the tools available for linear models can be applied automatically.

4.4.4 L 1 Regularized Logistic Regression

For logistic regression, we would maximize a penalized version of ( 4.20 )

Missing close brace

$$\tag{4.31}
\min \bigg{\sum_{i=1}^N{y_i\log p(x_i;\beta)+(1-y_i)\log(1-p(x_i;\beta))+\lambda\sum_{j=1}^p|\beta_j|\bigg}.
$$
The score equations [see (4.24)] for the variables with non-zero coefficients have the form x j T ( y p ) = λ sign ( β j ) , which generalizes (3.58) in Section 3.4.4; the active variables are tied in their generalized correlation with the residuals. Path algorithms such as LAR for lasso are more difficult, because thecoefficient profiles are piecewise smooth rather than linear. Nevertheless, progress can be made using quadratic approximations.

Figure 4.13 shows the L 1 regularization path for the South African heart disease data of Section 4.4.2.

Coordinate descent methods (Section 3.8.6) are very efficient for computing the coefficient profiles on a grid of values for λ .

4.4.5 Logistic Regression or LDA?

In Sectio 4.3, we find that the log-posterior odds between class k and K are linear function of x (4.9): $$\tag{4.33} \begin{aligned} \log \frac{\Pr(G=k|X=x)}{\Pr(G=K|X=x)} &=\log\frac{\pi_k}{\pi_{K}}-\frac{1}{2}(\mu_k+\mu_{K})^T\Sigma^{-1}(\mu_k-\mu_{K})+x^T\Sigma^{-1}(\mu_k-\mu_{K})\ &= \alpha_{k0}+\alpha_{k}^Tx. \end{aligned} $$ This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix. The linear logistic model (4.17) by construction has linear logits: log Pr ( G = k | X = x ) Pr ( G = K | X = x ) = β k 0 + β k x .

It seems that the models are the same. Although they have exactly the same form, the difference lies in the way the linear coefficients are estimated. The logistic regression model is more general, in that it makes less assumptions. We can write the joint density of X and G as Pr ( X , G = k ) = Pr ( X ) Pr ( G = k | X ) , where Pr ( X ) denotes the marginal density of the inputs X . For both LDA and logistic regression, the second term on the right has the logit-linear form Pr ( G = k | X = x ) = e β k 0 + β k T x 1 + = 1 K 1 e β 0 + β T x , where we have again arbitrarily chosen the last class as the reference. The logistic regression model leaves the marginal density of X as an arbitrary density function Pr ( X ) , and fits the parameters of Pr ( G | X ) by maximizing the conditional likelihood—the multinomial likelihood with probabilities the Pr ( G = k | X ) . Although Pr ( X ) is totally ignored, we can thinkof this marginal density as being estimated in a fully nonparametric and unrestricted fashion, using the empirical distribution function which places mass 1 / N at each observation.

With LDA we fit the parameters by maximizing the full log-likelihood based on the joint density Pr ( X , G = k ) = ϕ ( X ; μ k , Σ ) π k , where ϕ is the Gaussian density function. Since the linear parameters of the logistic form (4.33) are functions of the Gaussian parameters, we get their maximum-likelihood estimates by plugging in the corresponding estimates. However, unlike in the conditional case, the marginal density Pr(X) does play a role here. It is a mixture density Pr ( X ) = k = 1 K π k ϕ ( X ; μ k , Σ ) , which also involvs the parameters.

The additional model assumption: By relying on the additional model assumptions, we have more information about the parameters, and hence can estimate them more efficiently (lower variance). If in fact the true f k ( x ) are Gaussian, then in the worst case ignoring this marginal part of the likelihood constitutes a loss of efficiency of about 30% asymptotically in the error rate (Efron, 1975). Paraphrasing: with 30% more data, the conditional likelihood will do as well.

For example, observations far from the decision boundary (which are down-weighted by logistic regression) play a role in estimating the common covariance matrix. This is not all good news, because it also means that LDA is not robust to gross outliers.

From the mixture formalution, unlabeled observation have information about parameters.

The marginal likelihood can be thought of as a regularizer, requiring in some sense that class densities be visible from this marginal view. For example, if the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of the parameters are undefined (i.e., infinite; see Exercise 4.5). The LDA coefficients for the same data will be well defined, since the marginal likelihood will not permit these degeneracies.

Exercise 4.5.

In practice these assumptions are never correct, and often some of the components of X are qualitative variables. It is generally felt that logistic regression is a safer, more robust bet than the LDA model, relying on fewer assumptions. It is our experience that the models give very similar results, even when LDA is used inappropriately, such as with qualitative predictors.

4.5 Separating Hyperplanes

Separating hyperplane classifiers procedures construct linear decision boundaries that explicitly try to separate the data into different classes as well as possible. They provide the basis for support vector classifiers, discussed in Chapter 12.

Figure 4.14 shows 20 data points in two classes in R 2 .

The orange line is the least squares solution to the problem, obtained by regressing the 1 / 1 response Y and X (with intercept); the line is given by x : β ^ 0 + β ^ 1 x 1 + β ^ 2 x 2 = 0 .

This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by LDA, in light of its equivalence with linear regression in the two-class case (Section 4.3 and Exercise 4.2).

Classifiers such as (4.39), that compute a linear combination of the input features and return the sign, were called perceptrons in the engineering literature in the late 1950s.

Before we continue, let us digress slightly and review some vector algebra. Figure 4.15 depicts a hyperplane or affine set L defined by the equation f ( x ) = β 0 + β T x = 0 ; since we are in R 2 this is a line.

Here we list some properties:

  1. For any two points x 1 and x 2 lying in L , β T ( x 1 x 2 ) = 0 , and hence β = β / | β | is the vector normal to the surface of L .
  2. For any point x 0 in L , β T x 0 = β 0 .
  3. The signed distance of any point x to L is given by β T ( x x 0 ) = 1 | β | ( β T x + β 0 ) = 1 | f ( x ) | f ( x ) . Hence f ( x ) is proportional to the signed distance from x to the hyperplane defined by f ( x ) = 0 .

4.5.1 Rosenblatt’s Perceptron Learning Algorithm

The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary. If a response y i = 1 is misclassified, then x i T β + β 0 < 0 , and the opposite for a misclassified response with y i = 1 . The goal is to minimize D ( β , β 0 ) = i M y i ( x i T β + β 0 ) , where M indexes the set of misclassified points. The quantity is non-negative and proportional to the distance of the misclassified points to the decision boundary defined by β T x + β 0 = 0 . The gradient (assuming M is fixed) is given by D ( β , β 0 ) β = i M y i x i , D ( β , β 0 ) β 0 = i M y i . The algorithm in fact uses stochastic gradient descent to minimize this piecewise linear criterion. Hence the misclassified observations are visited in some sequence, and the parameters β are updated via $$\tag{4.44} \begin{pmatrix} \beta \ \beta_0 \end{pmatrix} \leftarrow \begin{pmatrix} \beta \ \beta_0 \end{pmatrix} + \rho \begin{pmatrix} y_ix_i \ y_i \end{pmatrix}. $$ Here ρ is the learning rate, which in this case can be taken to be 1 without loss in generality. If the classes are linearly separable, it can be shown that the algorithm converges to a separating hyperplane in a finite number of steps (Exercise 4.6).

Exercise 4.6

Figure 4.14 shows two solutions to a toy problem, each started at a different random guess.

There are a number of problems with this algorithm, summarized in Ripley (1996):

  • When the data are separable, there are many solutions, and which one is found depends on the starting values.
  • The “finite” number of steps can be very large. The smaller the gap, the longer the time to find it.
  • When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.

The second problem can often be eliminated by seeking a hyperplane not in the original space, but in a much enlarged space obtained by creating many basis-function transformations of the original variables. This is analogous to driving the residuals in a polynomial regression problem down to zero by making the degree sufficiently large. Perfect separation cannot always be achieved: for example, if observations from two different classes share the same input. It may not be desirable either, since the resulting model is likely to be overfit and will not generalize well. We return to this point at the end of the next section.

A rather elegant solution to the first problem is to add additional constraints to the separating hyperplane.

4.5.2 Optimal Separating Hyperplanes 🤔

The optimal separating hyperplane separates the two classes by maximizing the margin between the two classes on the training data. We need to generalize criterion (4.41). Considering the optimization problem max β , β 0 , | β | = 1 M   subject to  y i ( x i T β + β 0 ) M , i = 1 , . . . , N . We can get rid of the | β | = 1 constraint by replacing the condition with 1 | β | y i ( x i T β + β 0 ) M , or equivalently y i ( x i T β + β 0 ) M | β | . Since for any β and β 0 satisfying these inequalities, any positively scaled multiple satisfies them too, we can arbitrarily set | β | = 1 / M . Thus (4.45) is equivalent to min β , β 0 1 2 | β | 2   subject to  y i ( x i T β + β 0 ) 1 , i = 1 , . . . , N . In light of (4.40), the constraints define an empty slab or margin around the linear decision boundary of thickness 1 / | β | . Hence we choose β and β 0 to maximize its thickness. This is a convex optimization problem, and the Lagrange function, to be minimized w.r.t β and β 0 , is L P = 1 2 | β | 2 i = 1 N α i [ y i ( x i T β + β 0 ) 1 ] . Setting the derivatives to zero, we obtain: $$\tag{4.50} \begin{aligned} \beta &= \sum_{i=1}^N\alpha_iy_ix_i,\ 0 &= \sum_{i=1}^N \alpha_iy_i \end{aligned} $$ and substituting these in (4.49) we obtain the so-called Wolfe dual L D = i = 1 N α i 1 2 i = 1 N k = 1 N α i α k y i y k x i T x k   subject to  α i 0  and  i = 1 N α i y i = 0. The solution is obtained by maximizing L D in the positive orthant, a simpler convex optimization problem, for which standard software can be used. In addition the solution must satisfy the Karush–Kuhn–Tucker conditions, which include (4.50), (4.51), (4.52) and $$ \alpha_i[y_i(x^T_i\beta + \beta_0 ) − 1] = 0 \quad \forall i.$$

From these we can see that

  • if α i > 0 , then y i ( x i T β + β 0 ) = 1 , or in other words, x i is on the boundary of the slab;
  • if y i ( x i T β + β 0 ) > 1 , x i is not on the boundary of the slab, and a i = 0 .

From (4.50) we see that the solution vector β is defined in terms of a linear combination of the support points x i —— those points defined to be on the boundary of the slab via α i > 0 . Figure 4.16 shows the optimal separating hyperplane for our toy example; there are three support points. Likewise, β is obtained by solving (4.53) for any of the support points.

The optimal separating hyperplane produces a function f ^ ( x ) = x T β ^ + β ^ 0 for classifying new observations: G ^ ( x ) = sign f ^ ( x ) .

Although none of the training observations fall in the margin (by construction), this will not necessarily be the case for test observations. The intuition is that a large margin on the training data will lead to good separation on the test data.

Relation to LDA: The description of the solution in terms of support points seems to suggest that the optimal hyperplane focuses more on the points that count, and is more robust to model misspecification. The LDA solution, on the other hand, depends on all of the data, even points far away from the decision boundary. Note, however, that the identification of these support points required the use of all the data. Of course, if the classes are really Gaussian, then LDA is optimal, and separating hyperplanes will pay a price for focusing on the (noisier) data at the boundaries of the classes.

Relation to logistic regression When a separating hyperplane exists, logistic regression will always find it, since the log-likelihood can be driven to 0 in this case (Exercise 4.5). The logistic regression solution shares some other qualitative features with the separating hyperplane solution. The coefficient vector is defined by a weighted least squares fit of a zero-mean linearized response on the input features, and the weights are larger for points near the decision boundary than for those further away.

Exercise 4.5.

When the data are not separable, there will be no feasible solution to this problem, and an alternative formulation is needed. Again one can enlarge the space using basis transformations, but this can lead to artificial separation through over-fitting. In Chapter 12 we discuss a more attractive alternative known as the support vector machine, which allows for overlap, but minimizes a measure of the extent of this overlap.