Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).
Adaptive Boosting (AdaBoost)
One Sentence Summary:
Continuously adding weak base learners to the existing model and adaptively adjusting the weight for weak base learners.
-
a. What is Forward Stagewise Additive Modeling
Suppose b(x;rm) is a base learning controlled by parameter rm.
Beta_m is the parameter controlled by how each weak base learner is added.
Then the final model f(x) based on M weak base learners will be as below:
Sign(x) is the function to convert values into actual classes. And the output classifier will be as below:
Global Optimal for the dataset D with N samples is as below:
For each step, our optimal target is just to find the beta and r for the current base learning so as to approximate the above global optimal:
To sum up, the Forward Stagewise Additive Modeling Algorithm is as below:
Inputs: Dataset D = {(x1,y1), (x2,y2), ... , (xN,yN)}, loss function Loss(y, f(x)), base learner set {b(x, rm)}.
Outputs: final output function f(x).
Algorithm: -
b. What is exponential loss function and why we use it in AdaBoost
Suppose y belongs to {-1,1}, then the exponential loss function is as below:If we take the derivative and set it to 0, we will find out that when minimizing exponential loss function, we are actually like fitting a logistic regression (
), and we will reach the optimal Bayes error rate:
where:
-
c. Math behind AdaBoost- how to compute the optimal parameters
Suppose that now we have finished m-1 iterations and successfully computed the f_{m-1} as below:Now we are at iteration m, and we want to find the optimal beta_m and b_m(x, r_m) (simplified as b_m(x)) to minimize our exponential loss. Attention: the output of b_m(x) belongs to {-1,1} instead of probability. Our target is as below:
Where
c.1. compute the optimal b_m(x)
Since beta > 0, so we will have:
c.2. compute the optimal beta_m(x)
So in order to find the optimal beta, we need set the derivative to 0:
If we set the right hand side of the last equation to be e_m, then we will have:
c.3. update the optimal w_{m+1, i}(x)
But we want to normalize this term to make
into "probability" that sumed up to 1 (similar to softmax):
And this will not affect the way we compute the beta_m & b_m(x) because this will not affect e_m:
-
d. Actual Recurrent Algorithm for AdaBoost Tree
Model Input: Dataset D: D = {(x1,y1), ..., (x_N, y_N), y_i belongs to {-1,1}}
Model Output: Final classifier: G(x)Steps:
(2): for m = 1,2,3,4, ..., M (The final classifier is consists of M weak learners):
-
compute the error rate e_m of b_m(x) on the dataset D:
since
-
compute the parameter of b_m(x):
-
update the weight W_{m+1} for the next weak learner:
(3): Build up the final classifier:
-
e. A deeper look into how AdaBoost update the weights
Remeber that:
So for beta_m > 0, we will have:
which means that, if the classification is correct, then the weight of that sample will decrease, but if the classification is wrong, then the weight of that sample will increase.
Scikit-learn Application
AdaBoostClassifier:
class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)
-
base_estimator : object, optional (default=None)
The base estimator from which the boosted ensemble is built. The default base estimator is DecisionTreeClassifier(max_depth=1), you can also use other machine learning models such as SVC. It corresponds to the bm(x) in the formula. -
n_estimators : integer, optional (default=50)
The maximum number of estimators at which boosting is terminated, and represents the M in the formula. -
learning_rate : float, optional (default=1.)
Learning rate shrinks the contribution of each classifier by learning_rate. For example, suppose previous prediction fm-1 = 1, learning rate = 0.1, and next tree correction = 0.5, then the updated prediction fm = 1 + 0.1 * 0.5 = 1.05. Reducing learning rate forces the weight to change in a smaller pace, so it slows down the training porcess, but sometimes resulting in a better performance. -
algorithm :{‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)
If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.
Reference
- Freund Y, Schapire R, Abe N. A short introduction to boosting[J]. Journal-Japanese Society For Artificial Intelligence, 1999, 14(771-780): 1612.
- Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. New York: Springer series in statistics, 2001.
- Hang Li. Statistical Learning Method[M]. Tsinghua University Press, 2019. [Chinese]
- Zhihua Zhou. Machine Learning[M]. Tsinghua University Press, 2018. [Chinese]
- Wikipedia contributors. AdaBoost. Wikipedia, The Free Encyclopedia. November 1, 2019, 02:11 UTC. Available at: https://en.wikipedia.org/w/index.php?title=AdaBoost&oldid=923990902. Accessed November 11, 2019.
- Schapire R E. Explaining adaboost[M]//Empirical inference. Springer, Berlin, Heidelberg, 2013: 37-52.
- https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c
- https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe
- https://zhuanlan.zhihu.com/p/41536315 [Chinese]
- https://zhuanlan.zhihu.com/p/37358517 [Chinese]
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
- https://chrisalbon.com/machine_learning/trees_and_forests/adaboost_classifier/