Skip to content

Latest commit

 

History

History
190 lines (115 loc) · 21.1 KB

AdaBoost.md

File metadata and controls

190 lines (115 loc) · 21.1 KB

Tree-Math

Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).

Boosting Tree Models

Adaptive Boosting (AdaBoost)

One Sentence Summary:
Continuously adding weak base learners to the existing model and adaptively adjusting the weight for weak base learners.

  • a. What is Forward Stagewise Additive Modeling
    Suppose b(x;rm) is a base learning controlled by parameter rm.
    Beta_m is the parameter controlled by how each weak base learner is added.
    Then the final model f(x) based on M weak base learners will be as below:
    img

    Sign(x) is the function to convert values into actual classes. And the output classifier will be as below:

    img

    Global Optimal for the dataset D with N samples is as below:

    img

    For each step, our optimal target is just to find the beta and r for the current base learning so as to approximate the above global optimal:

    img

    To sum up, the Forward Stagewise Additive Modeling Algorithm is as below:

    Inputs: Dataset D = {(x1,y1), (x2,y2), ... , (xN,yN)}, loss function Loss(y, f(x)), base learner set {b(x, rm)}.
    Outputs: final output function f(x).
    Algorithm:

    • Initialize base learning f0 as f0(x) = 0

    • For m in {1,2,3,4,...,M}:
      Minimize the below loss function:
      img
      Then update the function:

      img

    • Then we have the final output model f(x) as below:
      img

  • b. What is exponential loss function and why we use it in AdaBoost
    Suppose y belongs to {-1,1}, then the exponential loss function is as below:

    img

    If we take the derivative and set it to 0, we will find out that when minimizing exponential loss function, we are actually like fitting a logistic regression (img), and we will reach the optimal Bayes error rate:

    img

    img

    img

    where:

    img

  • c. Math behind AdaBoost- how to compute the optimal parameters
    Suppose that now we have finished m-1 iterations and successfully computed the f_{m-1} as below:

    img

    Now we are at iteration m, and we want to find the optimal beta_m and b_m(x, r_m) (simplified as b_m(x)) to minimize our exponential loss. Attention: the output of b_m(x) belongs to {-1,1} instead of probability. Our target is as below:

    img

    Where

    img

    c.1. compute the optimal b_m(x)

    img

    Since beta > 0, so we will have:

    img

    c.2. compute the optimal beta_m(x)
    img

    So in order to find the optimal beta, we need set the derivative to 0:
    img

    If we set the right hand side of the last equation to be e_m, then we will have:

    img

    c.3. update the optimal w_{m+1, i}(x)

    img

    But we want to normalize this term to make img into "probability" that sumed up to 1 (similar to softmax):

    img

    And this will not affect the way we compute the beta_m & b_m(x) because this will not affect e_m:

    img

  • d. Actual Recurrent Algorithm for AdaBoost Tree
    Model Input: Dataset D: D = {(x1,y1), ..., (x_N, y_N), y_i belongs to {-1,1}}
    Model Output: Final classifier: G(x)

    Steps:

    (1) Initialize the weight T1:
    img

    (2): for m = 1,2,3,4, ..., M (The final classifier is consists of M weak learners):

    • use dataset D with weight T_m to train weak learner b_m:
      img

      img

    • compute the error rate e_m of b_m(x) on the dataset D:

      img

      since

      img

    • compute the parameter of b_m(x):

      img

    • update the weight W_{m+1} for the next weak learner:

      img

      img

      img

    (3): Build up the final classifier:

    img

  • e. A deeper look into how AdaBoost update the weights

    Remeber that:

    img

    So for beta_m > 0, we will have:

    img

    which means that, if the classification is correct, then the weight of that sample will decrease, but if the classification is wrong, then the weight of that sample will increase.

Scikit-learn Application

AdaBoostClassifier:
class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)

  • base_estimator : object, optional (default=None)
    The base estimator from which the boosted ensemble is built. The default base estimator is DecisionTreeClassifier(max_depth=1), you can also use other machine learning models such as SVC. It corresponds to the bm(x) in the formula.

  • n_estimators : integer, optional (default=50)
    The maximum number of estimators at which boosting is terminated, and represents the M in the formula.

  • learning_rate : float, optional (default=1.)
    Learning rate shrinks the contribution of each classifier by learning_rate. For example, suppose previous prediction fm-1 = 1, learning rate = 0.1, and next tree correction = 0.5, then the updated prediction fm = 1 + 0.1 * 0.5 = 1.05. Reducing learning rate forces the weight to change in a smaller pace, so it slows down the training porcess, but sometimes resulting in a better performance.

  • algorithm :{‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)
    If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.

Reference

  1. Freund Y, Schapire R, Abe N. A short introduction to boosting[J]. Journal-Japanese Society For Artificial Intelligence, 1999, 14(771-780): 1612.
  2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. New York: Springer series in statistics, 2001.
  3. Hang Li. Statistical Learning Method[M]. Tsinghua University Press, 2019. [Chinese]
  4. Zhihua Zhou. Machine Learning[M]. Tsinghua University Press, 2018. [Chinese]
  5. Wikipedia contributors. AdaBoost. Wikipedia, The Free Encyclopedia. November 1, 2019, 02:11 UTC. Available at: https://en.wikipedia.org/w/index.php?title=AdaBoost&oldid=923990902. Accessed November 11, 2019.
  6. Schapire R E. Explaining adaboost[M]//Empirical inference. Springer, Berlin, Heidelberg, 2013: 37-52.
  7. https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c
  8. https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe
  9. https://zhuanlan.zhihu.com/p/41536315 [Chinese]
  10. https://zhuanlan.zhihu.com/p/37358517 [Chinese]
  11. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
  12. https://chrisalbon.com/machine_learning/trees_and_forests/adaboost_classifier/