Tree-Math

Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).

Base Decision Tree

ID3 Model

One Sentence Summary:
Splitting the dataset recurrently on the features that yields the maximum information gain.

a. What is information gain

We define entropy H(X) as a metric to reflect the uncertainty of a random variable. Suppose that a random variable X follows the below dirtribution:

Then the entropy of X is as below. And high entropy value also means more uncertain variables.

So the information gain g(D, A) of dataset D conditioned on feature A is as below:
b. How to compute information gain

Suppose that the dataset D has k categories, each category is C1, C2, ... , Ck. Suppose that feature A can split the dataset into n subset D1, D2, D3,..., Dn.
Suppose that Dik denotes the subset of the sample of category k in subset Di.
1. compute H(D) of dataset D
  |A| means the number of sample in A
2. compute H(D|A) of dataset D conditioned on condition A
  suppose we split the dataset D on feature A into D1, D2, ..., Dn, totally n parts. Then the H(D|A) is as below:
c. The actual ID3 model & algorithm
During each split, we select the feature that gives the maximum increase in information gain (g(D, A)).

Model Input: dataset D, feature set A, stopping threshold e.
Model Output: ID3 decision tree.

Recurrent Algorithm:
(1) If every sample in D belongs to one class Ck, stop splitting at this node, predict every node in D as Ck.

(2.1) If A is empty, then we stop splitting at this node, predict D as the class that has the most samples.
(2.2) If A is not empty, then we loop through the feature set A, compute the information gain for each feature using the equations in a and b. Suppose among them the maximum information gain is ga, obtained by splitting on the featurea a.
(2.2.1) If ga > threshold e, then we split on feature a, and split dataset into mini subset {D1, D2, ... Di} based on different values of categorical feature a. For each subset in {D1, D2, ... Di}, treat Di as the new dataset D, and treat A-{a} as the new feature set A, repeat the splitting process.
(2.2.2) If ga <= threshold e, then we split stop splitting at this node, set this final node to be the class that has the most samples.

(3) The tree stops growing when there is no splitting in any of the subsets.

Reference

Quinlan J R. Induction of Decision Trees. Mach. Learn[J]. 1986.
Quinlan J R. C4. 5: programs for machine learning[M]. Elsevier, 2014.
Hang Li. Statistical Learning Method[M]. Tsinghua University Press, 2019. [Chinese]
Zhihua Zhou. Machine Learning[M]. Tsinghua University Press, 2018. [Chinese]
Wikipedia contributors. ID3 algorithm. Wikipedia, The Free Encyclopedia. October 23, 2019, 17:33 UTC. Available at: https://en.wikipedia.org/w/index.php?title=ID3_algorithm&oldid=922686642. Accessed November 11, 2019.
https://medium.com/machine-learning-guy/an-introduction-to-decision-tree-learning-id3-algorithm-54c74eb2ad55
https://towardsdatascience.com/decision-trees-introduction-id3-8447fd5213e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!