forked from janchorowski/ml_uwr
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Jan Chorowski
committed
Oct 31, 2019
1 parent
d1b3b36
commit bab0ddf
Showing
2 changed files
with
8 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Good materials for Decision trees are provided by: | ||
- The argmax.ai/TUM course: https://argmax.ai/ml-course/lecture-02-trees/ | ||
- In CS226 lecture notes http://cs229.stanford.edu/notes/cs229-notes-dt.pdf | ||
|
||
For the Random Forest an excellent reference is Breiman's paper and tutorial: | ||
- https://link.springer.com/article/10.1023/A:1010933404324 | ||
- https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"02_naive_bayes.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"jvMZ7p-XlTNz","colab_type":"text"},"source":["# Naive Bayes classifiers"]},{"cell_type":"markdown","metadata":{"id":"ZTwQ-esHXO9R","colab_type":"text"},"source":["A naive Bayes classifier uses the Bayes theorem to classify data. It is frequently used to filter out SPAM documents. To classify a document as SPAM/NONSPAM (HAM) we want\n","\n","\\begin{equation}\n"," p(\\text{SPAM}|\\text{TEXT})\n","\\end{equation}\n","\n","using the Bayes theorem we get\n","\n","\\begin{equation}\n"," p(\\text{SPAM}|\\text{TEXT}) =\n"," \\frac{p(\\text{TEXT}|\\text{SPAM})p(\\text{SPAM})}{p(\\text{TEXT})}\n","\\end{equation}\n","\n","The Bayes theorem allows us to express a classification problem as a generation problem: we will create a model for generating texts ($p(\\text{TEXT}|\\text{SPAM})$) and combine it with the prior probability of getting a spam ($p(\\text{SPAM})$). However, we will not need $p(\\text{TEXT})$: the probability of ever seeing a given document.\n","\n","To estimate $p(\\text{TEXT}|\\text{SPAM})$ we need to define a data generation model. A text is a sequence of words:\n","$$\n","\\text{TEXT} = W_1, W_2, W_3,\\ldots,W_n.\n","$$\n","Thus, \n","$$\n","p(\\text{TEXT}|\\text{SPAM}) = p(W_1|\\text{SPAM})p(W_2|W_1,\\text{SPAM})p(W_n|W_1, ..., W_{n-1}, \\text{SPAM})\n","$$\n","We will further simply this by (naively) assuming that \n","\n","\\begin{equation}\n"," \\begin{split}\n"," p(\\text{TEXT}|\\text{SPAM}) &= (W_1|\\text{SPAM})p(W_2|W_1,\\text{SPAM})p(W_n|W_1, ..., W_{n-1}, \\text{SPAM}) \\\\\n"," &\\approx p(W_1|\\text{SPAM})p(W_2|\\text{SPAM})p(W_n|\\text{SPAM}) \\\\\n"," &= \\prod_{W_i \\in \\text{TEXT}}p(W_i|\\text{SPAM})\n","\\end{split}\n","\\end{equation}\n","\n","This corresponds to a generative models in which the sender first flips a biased coin to see if the generated document will be a spam or ham one. Then, he picks a box labeled spam or ham. Finally, the sender draws with replacement words from the appropriate box.\n","\n","The full sampling model has the following parameters:\n","1. $\\phi$ - the probability of generating a SPAM.\n","2. $\\theta_{w,s}$ - the probability of generating word $w$ in a SPAM document, $\\sum_w \\theta_{w,s}=1$,\n","3. $\\theta_{w,h}$ - the probability of generating word $w$ in a HAM document, $\\sum_w \\theta_{w,h}=1$.\n","\n","All parameters are easy to estimate using maximum likelihood principle:\n","1. $\\phi = p(\\text{SPAM}=s)$ is just the fraction of all spams in our corpus.\n","2. $\\theta_{w,s} = p(W=w|SPAM=s)$ is the fraction of the number of occurrences of word $w$ in all spams.\n","3. $\\theta_{w,h} = p(W=w|SPAM=h)$ is the fraction of the number of occurrences of word $w$ in all non-spams.\n","\n","The derivation is somewhat tedious, and requires the use of Langrange multipliers:\n"]},{"cell_type":"markdown","metadata":{"id":"NQOwEUn7YKk_","colab_type":"text"},"source":["Example:\n","\n","suppose our corpus has 4 documents:\n","1. \"buy much now\": SPAM\n","2. \"many dollars gain\": SPAM\n","3. \"like you much\": HAM\n","4. \"do your nice homework\": HAM\n","\n","Then:\n","$\\phi = p(\\text{SPAM}=s) = 2/4 = 0.5$\n","\n","$\\theta_{w,h}$ is given by the following table\n","\n","| | buy | much | now | dollars | gain | like | you/your | do | homework | nice |\n","|------|-----|------|-----|---------|------|------|----------|-----|----------|------|\n","| SPAM | 1/6 | 2/6 | 1/6 | 1/6 | 1/6 | 0/6 | 0/6 | 0/6 | 0/6 | 0/6 |\n","| HAM | 0/7 | 1/7 | 0/7 | 0/7 | 0/7 | 1/7 | 2/7 | 1/7 | 1/7 | 1/7 |\n","\n","To classify a new phrase \"much much gain\" we compute\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= p(\\text{SPAM}=s) p(\\text{much}|\\text{SPAM}=s)p(\\text{much}|\\text{SPAM}=s)p(\\text{gain}|\\text{SPAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 2/6 \\cdot 2/6 \\cdot 1/6 \\cdot 1/Z = 4/216 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","$$\n","\\begin{split}\n","p(\\text{HAM} = s | \\text{\"much much gain\"}) &= p(\\text{HAM}=s) p(\\text{much}|\\text{HAM}=s)p(\\text{much}|\\text{HAM}=s)p(\\text{gain}|\\text{HAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 1/7 \\cdot 1/7 \\cdot 0/6 \\cdot 1/Z = 0\n","\\end{split}\n","$$\n","\n","Thus the text is classified as SPAM."]},{"cell_type":"markdown","metadata":{"id":"i8I5sfiTfz3N","colab_type":"text"},"source":[" In fact all text with word \"gain\" will never be classified as non-spam, because \"gain\" never appeared in a non-spam document. This is a problem with modeling: for rare words, we are using maximum likelihood estimation to compute the frequencies. However MLE estimation doesn't work well with low data counts.\n"," \n"," Inspired by the Bayesian approach to polling (estimating counts) a common technique is called Laplace smoothing - assuming that each word in the vocabulary was seen at least once (or even a fraction of times) in each kind of document.\n"," \n"," With Laplace smoothing (assuming each word occurred 0.5 times in spam and 0.5 times in ham) the table becomes\n"," \n","| | buy | much | now | dollars | gain | like | you/your | do | homework | nice |\n","|------|-----|------|-----|---------|------|------|----------|-----|----------|------|\n","| SPAM | 1.5/11 | 2.5/11 | 1.5/11 | 1.5/11 | 1.5/11 | 0.5/11 | 0.5/11 | 0.5/11 | 0.5/11 | 0.5/11 |\n","| HAM | 0.5/12 | 1.5/12 | 0.5/12 | 0.5/12 | 0.5/12 | 1.5/12 | 2.5/12 | 1.5/12 | 1.5/12 | 1.5/12 |\n","\n","Now:\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= 1/2 \\cdot 2.5/11 \\cdot 2.5/11 \\cdot 1.5/11 \\cdot 1/Z = 9.375/2662 \\cdot 1/Z \\\\\n","p(\\text{SPAM} = h | \\text{\"much much gain\"}) &= 1/2 \\cdot 1.5/12 \\cdot 1.5/12 \\cdot 0.5/12 \\cdot 1/Z = 1.125/3456 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","Since $p(\\text{SPAM} = s | \\text{\"much much gain\"}) + p(\\text{SPAM} = h | \\text{\"much much gain\"}) = 1$ we can work out that $Z=0.00385$ and\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= 9.375/2662 \\cdot 1/Z = 0.915 = 91.5\\%\\\\\n","p(\\text{SPAM} = h | \\text{\"much much gain\"}) &= 1.125/3456 \\cdot 1/Z = 0.085 = 8.5\\%\n","\\end{split}\n","$$\n","\n","Thus the modelpredicts that with $91.5\\%$ the new text is SPAM."]},{"cell_type":"code","metadata":{"id":"C200U28dRM8a","colab_type":"code","colab":{}},"source":[""],"execution_count":0,"outputs":[]}]} |