Skip to content

Latest commit

 

History

History
66 lines (48 loc) · 5.39 KB

C4_5.md

File metadata and controls

66 lines (48 loc) · 5.39 KB

Tree-Math

Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).

Base Decision Tree

C4.5 Model

One Sentence Summary:
Splitting the dataset recurrently on the feature that yields maximum information gain ratio rather than information gain 

  • a. What is information gain ratio

    The information gain ratio gr(D, A) of dataset D conditioned on feature A is as below:

    img

    The entropy of dataset D and the entropy of dataset D conditioned on feature A are the same as discussed in ID3:

    img

    img

    Here we introduce the intrinsic value of A img as below:

    img

  • b. How to compute information gain ratio

    Suppose that the dataset D has k categories, C1, C2, ... , Ck. Suppose that feature A can split the dataset into n subset D1, D2, D3,..., Dn.
    Suppose that Dik denotes the subset of the sample of category k in subset Di.

    1. compute H(D) of dataset D
      |A| means the number of sample in A
      img

    2. compute H(D|A) of dataset D conditioned on condition A
      suppose we split the dataset on feature A into D1, D2, ..., Dn, totally n parts. Then the H(D|A) is as below:
      img

    3. compute information gain ratio gr(D, A) conditioned on condition A
      Below is the formula to compute the information gain
      img
      img

  • c. The actual C4.5 model & algorithm
    During each split, we find the feature that gives the dataset the maximum increase in information gain ratio (gr(D, A)).

    Model Input: dataset D, feature set A, stopping threshold e.
    Model Output: C4.5 decision tree.

    Recurrent Algorithm: (very similar to ID3)
    (1) If all the sample in D already only belongs one class Ck, stop splitting at this node, set this final node to be class Ck.

    (2.1) If A is empty, then we stop splitting at this node, set this final node to be the class that has the most samples.
    (2.2) If A is not empty, then we loop through all the features in A, compute the information gain ratio for each of them using the above equation. Suppose among them the maximum information gain ratio is gra, obtained by splitting on the feature a.
    (2.2.1) If gra > threshold e, then we split on feature a, and break dataset into separate subset Di based on different value of category a. For each subset dataset in {D1, D2, ... Di}, treat Di as the new dataset D, treat A-{a} as the new feature set A, recurrently continue this splitting process.
    (2.2.2) If gra <= threshold e, then we split stop splitting at this node, set this final node to be the class that has the most samples.

Reference

  1. Quinlan J R. C4. 5: programs for machine learning[M]. Elsevier, 2014.
  2. Hang Li. Statistical Learning Method[M]. Tsinghua University Press, 2019. [Chinese]
  3. Zhihua Zhou. Machine Learning[M]. Tsinghua University Press, 2018. [Chinese]
  4. Wikipedia contributors. (2019, February 16). C4.5 algorithm. In Wikipedia, The Free Encyclopedia. Retrieved 15:40, November 11, 2019, from https://en.wikipedia.org/w/index.php?title=C4.5_algorithm&oldid=883549387
  5. Ian H. Witten; Eibe Frank; Mark A. Hall (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco. p. 191.
  6. https://towardsdatascience.com/what-is-the-c4-5-algorithm-and-how-does-it-work-2b971a9e7db0