This chapter focos on linear method for classification (the decision boundaries are linear). Suppose predictor
Suppose, there are
Given a model discriminant functions
- The regression approach: Supppose the fitted linear model for the $k$th indicator response variable is $\hat{f}k(x)=\hat{\beta}{k0}+\hat{\beta}k^Tx$. The decision boundary between class $k$ and $l$ is that set of points for which $\hat{f}k(x)=\hat{f}{\ell}(x)$, that is the set ${x:(\hat{\beta}{k0}-\hat{\beta}_{\ell 0})+(\hat{\beta}k-\hat{\beta}{\ell})^Tx = 0}$, an affine set or hyperplane.
- The posterior probilities: The common useful model is
$$\tag{4.1}
\begin{aligned}
\Pr(G=k|X=x) &= \frac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{\ell=1}^{K-1}\exp(\beta_{\ell 0}+\beta_{\ell}^Tx)}, k = 1,..., K-1\
\Pr(G=K|X=x) &= \frac{1}{1+\sum_{\ell=1}^{K-1}\exp(\beta_{\ell 0}+\beta_{\ell}^Tx)}.
\end{aligned}
$$
Here the monotone transformation is the logit trainsformation and in fact we see that
The decision boundary is the set of points for which the log-odds are zero. We discuss two very popular but different methods that result in linear log-odds or logits: linear discriminant analysis and linear logistic regression. Although they differ in their derivation, the essential difference between them is in the way the linear function is fit to the training data. - Explicitly model the boundaries between the classes as linear: The first is the well-known perceptron, the second is optimally separating hyperplane. We treat the separable case here, and defer treatment of the nonseparable case to Chapter 12.
While this entire chapter is devoted to linear decision boundaries, there is considerable scope for generalization. For example, expand the variable set by including their squares and cross-products. This approach can be used with any basis transformation
Here each of the response categories are coded via an indicator variable. Then if
- the coefficient vector for each response column
; - the
has coefficients; - compute the fitted output
, a vector; - identify the largest component and classify accordingly:
What is the rationable for this approach? We know that the regression as an estimate of conditional expectation. So, for the random variable
A more simplistic viewpoint is to construct targets
- The closest target classification rule (4.6) is easily seen to be exactly the same as the maximum fitted component criterion (4.4).
There is a serious problem with the regression approach when the number of classes
For the cases in figure 4.3, if there are
Decision theory for classification (Section 2.4) tells us that we need to know the class posteriors
Many techniques are based on models for the class densities:
- linear and quadratic discriminant analysis use Gaussian densities;
- more flexible mixtures of Gaussians allow for nonlinear decision boundaries (Section 6.8);
- general nonparametric density estimates for each class density allow the most flexibility (Section 6.6.2);
- Naive Bayes models are a variant of the previous case, and assume that each of the class densities are products of marginal densities; that is, they assume that the inputs are conditionally independent in each class (Section 6.6.3).
Suppose that we model each class density as multivariate Gaussian
Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix
Notice that the decision boundaries are not the perpendicular bisectors of the line segments joining the centroids. This (perpendicular) would be the case if the covariance
In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using our training data:
-
, where is the number of class-$k$ observations; - $\hat{\mu}k=\sum{g_i=k}x_i/N_k$;
-
.
With two classes, the LDA rule classifies to class 2 if
Exercise 4.2.
Since this derivation of the LDA direction via least squares does not use a Gaussian assumption for the features, its applicability extends beyond the realm of Gaussian data. However the derivation of the particular intercept or cut-point given in (4.11) does require Gaussian data. Thus it makes sense to instead choose the cut-point that empirically minimizes training error for a given dataset. This is something we have found to work well in practice, but have not seen it mentioned in the literature.
With more than two classes, LDA is not the same as linear regression of the class indicator matrix, and it avoids the masking problems associated with that approach (Hastie et al., 1994). A correspondence between regres- sion and LDA can be established through the notion of optimal scoring, discussed in Section 12.5.
Getting back to the general discriminant problem (4.8), if the
Figure 4.6 shows an example (from Figure 4.1 on page 103) where the three classes are Gaussian mixtures (Section 6.8) and the decision boundaries are approximated by quadratic equations in
For this figure and many similar figures in the book we compute the decision bound- aries by an exhaustive contouring method. We compute the decision rule on a fine lattice of points, and then use contouring algorithms to compute the boundaries.
When
Friedman (1989) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. These methods are very similar in flavor to ridge regression. The regularized covariance matrices have the form
Figure 4.7 shows the results of RDA applied to the vowel data.
Similar modifications allow $\hat{\Sigma}$ itself to be shrunk toward the scalar covariance,for
In Chapter 12, we discuss other regularized versions of LDA, which are more suitable when the data arise from digitized analog signals and images. In Chapter 18 we also deal with very high-dimensional problems, where for example the features are gene- expression measurements in microarray studies. There the methods focus on the case
We briefly digress on the computations required for LDA and especially QDA. By the eigen decomposition for each
$\hat{\Sigma}_k=\mathbf{U_kD_kU_k}^T$
-
; - $\log|\hat{\Sigma}k|=\sum{\ell}\log d_{k\ell}$.
In light of the computational steps outlined above, the LDA classifier can be implemented by the following pair of steps:
-
Sphere the1data with respect to the common covariance estimate
, where . The common covariance estimate of will now be the identity. Since
, $$ \hat{\Sigma}^* = \mathbf{D}^{-\frac{1}{2}}\mathbf{U}^T\hat{\Sigma}\mathbf{U}\mathbf{D}^{-\frac{1}{2}} = \mathbf{D}^{-\frac{1}{2}}\mathbf{U}^T\mathbf{U}\mathbf{D}\mathbf{U}^T\mathbf{U}\mathbf{D}^{-\frac{1}{2}} = \mathbf{I}. $$
-
Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities
. $$\delta^_k(x)=-\frac{1}{2}|x^-\hat{\mu}^*_k|_2^2+\log \pi_k$$
LDA as a restricted Gaussian classifier allow us to view informative low-dimensional projections of the data. The
We might then ask for a
Figure 4.4 shows such an optimal two-dimensional subspace for the vowel data. Here there are eleven classes, each a different vowel sound, in a ten-dimensional input space. The centroids require the full space in this case, since
Figure 4.8 shows four additional pairs of coordinates, also known as canonical or discriminant variables.
In summary, finding the sequences of optimal subspaces for LDA involves the following steps:
- compute the
matrix of class centroids and the common covariance matrix (for within-class covariance); - compute
using the eigen-decomposition of ; - compute $\mathbf{B}^$, the covariance matrix of $\mathbf{M}^$ (
for between-class covariance), and its eigen-decomposition $\mathbf{B}^=\mathbf{V}^\mathbf{D}{B}\mathbf{V}^{*T}$. The columns $v{\ell}^$ of $\mathbf{V}^$ in sequence from first to last define the coordinates of the optimal subspaces.
Combining all these operations the $\ell$th discriminant variable is given by
Fisher arrived at this decomposition via a different route, without referring to Gaussian distributions at all. He posed the problem:
Find the linear combination $Z = a^TX$ such that the between-class variance is maximized relative to the within-class variance.
The between class variance matrix is the variance of the class means of
, $$ \mathbf{B} = \sum_{k=1}^{K}\sum_{g_{i}=k}{(\hat{\mu}{k}-\hat{\mu})(\hat{\mu}{k}-\hat{\mu})^{T}/(K-1)}, $$ the within class variance is the pooled variance about the means $$ \mathbf{W} = \sum_{k=1}^{K}\sum_{g_{i}=k}{(x_{i}-\hat{\mu}{k})(x{i}-\hat{\mu}{k})^{T}/(N-K)}, $$ The total covariance matrix of $X$, ignoring class information $$ \mathbf{T} = \sum{k=1}^K\sum_{g_i=k}(x_i-\hat{\mu})(x_i-\hat{\mu})^T/(N-1). $$ It is easy to prove that
. The between-class variance of is and the within-class variance is .
Figure 4.9 shows why this criterion makes sense.
Fisher's problem therefore amounts to maximizing the Rayleigh quotient,
which can rewrite after the convenient basis change
, $a^T\mathbf{B}a = a^{T}\mathbf{W}^{-1/2}\mathbf{B}\mathbf{W}^{-1/2}a^ = a^{T}\mathbf{B}^a^$, $$ \min_{a^}-\frac{1}{2}a^{T}\mathbf{B}^a^ \text{ subject to } a^{T}a^=1. $$ The Lagrangien for this problem writes $$ L=-\frac{1}{2}a^{T}\mathbf{B}^a^+\frac{1}{2}\lambda(a^{T}a^-1). $$ and the Karush-Kuhn-Tucker conditions give $$ \mathbf{B}^a^=\lambda a^ \equiv \mathbf{W}^{-1}\mathbf{B}a=\lambda a. $$ Thus, the optimal $a^$ corresponding the largest eigenvalue of $\mathbf{B}^$, that is $v_{1}^$. And given by the largest eigenvalue of . Similarly one can find the next direction $v_{2}^$, and so on. $v_{\ell}=\mathbf{W}^{-\frac{1}{2}}v_{\ell}^$. It is not hard to show (Exercise 4.1) that the optimal is identical to defined above.
The
To summarize the developments so far:
- Gaussian classification with common covariances leads to linear deci- sion boundaries. Classification can be achieved by sphering the data with respect to
, and classifying to the closest centroid (modulo ) in the sphered space. - Since only the relative distances to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.
- This subspace can be further decomposed into successively optimal subspaces in term of centroid separation. This decomposition is identical to the decomposition due to Fisher.
One can show that this is a Gaussian classification rule with the additional restriction that the centroids of the Gaussians lie in a
Exercise 4.8
Gaussian classification dictates the logπk correction factor in the dis- tance calculation. The reason for this correction can be seen in Figure 4.9. The misclassification rate is based on the area of overlap between the two densities. If the
Figure 4.10 shows the results. Figure 4.11 shows the decision boundaries for the classifier based on the two-dimensional LDA solution.
There is a close connection between Fisher’s reduced rank discriminant analysis and regression of an indicator response matrix. It turns out that LDA amounts to the regression followed by an eigen-decomposition of
Exercise 4.3.
The logistic regression model arises from the desire to model the posterior probabilities of the
Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of
In the two-class case, via a
Let
For the multiclass case
Exercise 4.4.
Alternatively coordinate-descent methods (Section 3.8.6) can be used to maximize the log-likelihood efficiently.
Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables. in explaining the outcome. Typically many models are fit in a search for a parsimonious model involving a subset of the variables, possibly with some interactions terms. The following example illustrates some of the issues involved.
- Example: South African Heart Disease
The maximum-likelihood parameter estimates
- The weighted residual sum-of-squares is the familiar Pearson chi-square statistic
a quadratic approximation to the deviance. - Asymptotic likelihood theory says that if the model is correct, then
is consistent (i.e., converges to the true ). - A central limit theorem then shows that the distribution of
converges to . This and other asymptotics can be derived directly from the weighted least squares fit by mimicking normal theory inference. For the weighted least squares, the estimated parameter values are linear combinations of the observed values
Therefore, an expression for the estimated variance-covariance matrix of the parameter estimates can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by and that of the estimated parameters by . Then $$ \mathbf{M}^{\beta} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{M}\mathbf{W}^T\mathbf{X}(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}. $$ when , this simplifies to $$ \mathbf{M}^{\beta} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}. $$ When unit weights are used ( , the identity matrix), it is implied that the experimental errors are uncorrelated and all equal: , where is the a priori variance of an observation. - Model building can be costly for logistic regression models, because each model fitted requires iteration. Popular shortcuts are the Rao score test which tests for inclusion of a term, and the Wald test which can be used to test for exclusion of a term. Neither of these require iterative fitting, and are based on the maximum-likelihood fit of the current model. It turns out that both of these amount to adding or dropping a term from the weighted least squares fit, using the same weights. Such computations can be done efficiently, without recomputing the entire weighted least squares fit.
- GLM (generalized linear model) objects can be treated as linear model objects, and all the tools available for linear models can be applied automatically.
For logistic regression, we would maximize a penalized version of
$$\tag{4.31}
\min \bigg{\sum_{i=1}^N{y_i\log p(x_i;\beta)+(1-y_i)\log(1-p(x_i;\beta))+\lambda\sum_{j=1}^p|\beta_j|\bigg}.
$$
Figure 4.13 shows the L 1 regularization path for the South African heart disease data of Section 4.4.2.
Coordinate descent methods (Section 3.8.6) are very efficient for computing the coefficient profiles on a grid of values for
In Sectio 4.3, we find that the log-posterior odds between class
It seems that the models are the same. Although they have exactly the same form, the difference lies in the way the linear coefficients are estimated. The logistic regression model is more general, in that it makes less assumptions.
We can write the joint density of
With LDA we fit the parameters by maximizing the full log-likelihood based on the joint density
The additional model assumption: By relying on the additional model assumptions, we have more information about the parameters, and hence can estimate them more efficiently (lower variance). If in fact the true
For example, observations far from the decision boundary (which are down-weighted by logistic regression) play a role in estimating the common covariance matrix. This is not all good news, because it also means that LDA is not robust to gross outliers.
From the mixture formalution, unlabeled observation have information about parameters.
The marginal likelihood can be thought of as a regularizer, requiring in some sense that class densities be visible from this marginal view. For example, if the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of the parameters are undefined (i.e., infinite; see Exercise 4.5). The LDA coefficients for the same data will be well defined, since the marginal likelihood will not permit these degeneracies.
Exercise 4.5.
In practice these assumptions are never correct, and often some of the components of
Separating hyperplane classifiers procedures construct linear decision boundaries that explicitly try to separate the data into different classes as well as possible. They provide the basis for support vector classifiers, discussed in Chapter 12.
Figure 4.14 shows 20 data points in two classes in
The orange line is the least squares solution to the problem, obtained by regressing the
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by LDA, in light of its equivalence with linear regression in the two-class case (Section 4.3 and Exercise 4.2).
Classifiers such as (4.39), that compute a linear combination of the input features and return the sign, were called perceptrons in the engineering literature in the late 1950s.
Before we continue, let us digress slightly and review some vector algebra. Figure 4.15 depicts a hyperplane or affine set
Here we list some properties:
- For any two points
and lying in , , and hence is the vector normal to the surface of . - For any point
in , . - The signed distance of any point
to is given by Hence is proportional to the signed distance from to the hyperplane defined by .
The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary. If a response
Exercise 4.6
Figure 4.14 shows two solutions to a toy problem, each started at a different random guess.
There are a number of problems with this algorithm, summarized in Ripley (1996):
- When the data are separable, there are many solutions, and which one is found depends on the starting values.
- The “finite” number of steps can be very large. The smaller the gap, the longer the time to find it.
- When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.
The second problem can often be eliminated by seeking a hyperplane not in the original space, but in a much enlarged space obtained by creating many basis-function transformations of the original variables. This is analogous to driving the residuals in a polynomial regression problem down to zero by making the degree sufficiently large. Perfect separation cannot always be achieved: for example, if observations from two different classes share the same input. It may not be desirable either, since the resulting model is likely to be overfit and will not generalize well. We return to this point at the end of the next section.
A rather elegant solution to the first problem is to add additional constraints to the separating hyperplane.
The optimal separating hyperplane separates the two classes by maximizing the margin between the two classes on the training data. We need to generalize criterion (4.41). Considering the optimization problem
From these we can see that
- if
, then , or in other words, is on the boundary of the slab; - if
, is not on the boundary of the slab, and .
From (4.50) we see that the solution vector
The optimal separating hyperplane produces a function
Although none of the training observations fall in the margin (by construction), this will not necessarily be the case for test observations. The intuition is that a large margin on the training data will lead to good separation on the test data.
Relation to LDA: The description of the solution in terms of support points seems to suggest that the optimal hyperplane focuses more on the points that count, and is more robust to model misspecification. The LDA solution, on the other hand, depends on all of the data, even points far away from the decision boundary. Note, however, that the identification of these support points required the use of all the data. Of course, if the classes are really Gaussian, then LDA is optimal, and separating hyperplanes will pay a price for focusing on the (noisier) data at the boundaries of the classes.
Relation to logistic regression When a separating hyperplane exists, logistic regression will always find it, since the log-likelihood can be driven to 0 in this case (Exercise 4.5). The logistic regression solution shares some other qualitative features with the separating hyperplane solution. The coefficient vector is defined by a weighted least squares fit of a zero-mean linearized response on the input features, and the weights are larger for points near the decision boundary than for those further away.
Exercise 4.5.
When the data are not separable, there will be no feasible solution to this problem, and an alternative formulation is needed. Again one can enlarge the space using basis transformations, but this can lead to artificial separation through over-fitting. In Chapter 12 we discuss a more attractive alternative known as the support vector machine, which allows for overlap, but minimizes a measure of the extent of this overlap.