Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).
C4.5 Model
One Sentence Summary:
Splitting the dataset recurrently on the feature that yields maximum information gain ratio rather than information gain
-
a. What is information gain ratio
The information gain ratio gr(D, A) of dataset D conditioned on feature A is as below:
The entropy of dataset D and the entropy of dataset D conditioned on feature A are the same as discussed in ID3:
-
b. How to compute information gain ratio
Suppose that the dataset D has k categories, C1, C2, ... , Ck. Suppose that feature A can split the dataset into n subset D1, D2, D3,..., Dn.
Suppose that Dik denotes the subset of the sample of category k in subset Di.-
compute H(D) of dataset D
|A| means the number of sample in A
-
compute H(D|A) of dataset D conditioned on condition A
suppose we split the dataset on feature A into D1, D2, ..., Dn, totally n parts. Then the H(D|A) is as below:
-
compute information gain ratio gr(D, A) conditioned on condition A
Below is the formula to compute the information gain
-
-
c. The actual C4.5 model & algorithm
During each split, we find the feature that gives the dataset the maximum increase in information gain ratio (gr(D, A)).Model Input: dataset D, feature set A, stopping threshold e.
Model Output: C4.5 decision tree.Recurrent Algorithm: (very similar to ID3)
(1) If all the sample in D already only belongs one class Ck, stop splitting at this node, set this final node to be class Ck.(2.1) If A is empty, then we stop splitting at this node, set this final node to be the class that has the most samples.
(2.2) If A is not empty, then we loop through all the features in A, compute the information gain ratio for each of them using the above equation. Suppose among them the maximum information gain ratio is gra, obtained by splitting on the feature a.
(2.2.1) If gra > threshold e, then we split on feature a, and break dataset into separate subset Di based on different value of category a. For each subset dataset in {D1, D2, ... Di}, treat Di as the new dataset D, treat A-{a} as the new feature set A, recurrently continue this splitting process.
(2.2.2) If gra <= threshold e, then we split stop splitting at this node, set this final node to be the class that has the most samples.
Reference
- Quinlan J R. C4. 5: programs for machine learning[M]. Elsevier, 2014.
- Hang Li. Statistical Learning Method[M]. Tsinghua University Press, 2019. [Chinese]
- Zhihua Zhou. Machine Learning[M]. Tsinghua University Press, 2018. [Chinese]
- Wikipedia contributors. (2019, February 16). C4.5 algorithm. In Wikipedia, The Free Encyclopedia. Retrieved 15:40, November 11, 2019, from https://en.wikipedia.org/w/index.php?title=C4.5_algorithm&oldid=883549387
- Ian H. Witten; Eibe Frank; Mark A. Hall (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco. p. 191.
- https://towardsdatascience.com/what-is-the-c4-5-algorithm-and-how-does-it-work-2b971a9e7db0