From bab0ddf6367019474922f3121abdb2ef39538d53 Mon Sep 17 00:00:00 2001 From: Jan Chorowski Date: Thu, 31 Oct 2019 22:24:13 +0100 Subject: [PATCH] Lecture notes --- lectures/03_04_materials.md | 7 +++++++ lectures/03_naive_bayes.ipynb | 1 + 2 files changed, 8 insertions(+) create mode 100644 lectures/03_04_materials.md create mode 100644 lectures/03_naive_bayes.ipynb diff --git a/lectures/03_04_materials.md b/lectures/03_04_materials.md new file mode 100644 index 0000000..829cd77 --- /dev/null +++ b/lectures/03_04_materials.md @@ -0,0 +1,7 @@ +Good materials for Decision trees are provided by: +- The argmax.ai/TUM course: https://argmax.ai/ml-course/lecture-02-trees/ +- In CS226 lecture notes http://cs229.stanford.edu/notes/cs229-notes-dt.pdf + +For the Random Forest an excellent reference is Breiman's paper and tutorial: +- https://link.springer.com/article/10.1023/A:1010933404324 +- https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf diff --git a/lectures/03_naive_bayes.ipynb b/lectures/03_naive_bayes.ipynb new file mode 100644 index 0000000..834dfd3 --- /dev/null +++ b/lectures/03_naive_bayes.ipynb @@ -0,0 +1 @@ +{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"02_naive_bayes.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"jvMZ7p-XlTNz","colab_type":"text"},"source":["# Naive Bayes classifiers"]},{"cell_type":"markdown","metadata":{"id":"ZTwQ-esHXO9R","colab_type":"text"},"source":["A naive Bayes classifier uses the Bayes theorem to classify data. It is frequently used to filter out SPAM documents. To classify a document as SPAM/NONSPAM (HAM) we want\n","\n","\\begin{equation}\n"," p(\\text{SPAM}|\\text{TEXT})\n","\\end{equation}\n","\n","using the Bayes theorem we get\n","\n","\\begin{equation}\n"," p(\\text{SPAM}|\\text{TEXT}) =\n"," \\frac{p(\\text{TEXT}|\\text{SPAM})p(\\text{SPAM})}{p(\\text{TEXT})}\n","\\end{equation}\n","\n","The Bayes theorem allows us to express a classification problem as a generation problem: we will create a model for generating texts ($p(\\text{TEXT}|\\text{SPAM})$) and combine it with the prior probability of getting a spam ($p(\\text{SPAM})$). However, we will not need $p(\\text{TEXT})$: the probability of ever seeing a given document.\n","\n","To estimate $p(\\text{TEXT}|\\text{SPAM})$ we need to define a data generation model. A text is a sequence of words:\n","$$\n","\\text{TEXT} = W_1, W_2, W_3,\\ldots,W_n.\n","$$\n","Thus, \n","$$\n","p(\\text{TEXT}|\\text{SPAM}) = p(W_1|\\text{SPAM})p(W_2|W_1,\\text{SPAM})p(W_n|W_1, ..., W_{n-1}, \\text{SPAM})\n","$$\n","We will further simply this by (naively) assuming that \n","\n","\\begin{equation}\n"," \\begin{split}\n"," p(\\text{TEXT}|\\text{SPAM}) &= (W_1|\\text{SPAM})p(W_2|W_1,\\text{SPAM})p(W_n|W_1, ..., W_{n-1}, \\text{SPAM}) \\\\\n"," &\\approx p(W_1|\\text{SPAM})p(W_2|\\text{SPAM})p(W_n|\\text{SPAM}) \\\\\n"," &= \\prod_{W_i \\in \\text{TEXT}}p(W_i|\\text{SPAM})\n","\\end{split}\n","\\end{equation}\n","\n","This corresponds to a generative models in which the sender first flips a biased coin to see if the generated document will be a spam or ham one. Then, he picks a box labeled spam or ham. Finally, the sender draws with replacement words from the appropriate box.\n","\n","The full sampling model has the following parameters:\n","1. $\\phi$ - the probability of generating a SPAM.\n","2. $\\theta_{w,s}$ - the probability of generating word $w$ in a SPAM document, $\\sum_w \\theta_{w,s}=1$,\n","3. $\\theta_{w,h}$ - the probability of generating word $w$ in a HAM document, $\\sum_w \\theta_{w,h}=1$.\n","\n","All parameters are easy to estimate using maximum likelihood principle:\n","1. $\\phi = p(\\text{SPAM}=s)$ is just the fraction of all spams in our corpus.\n","2. $\\theta_{w,s} = p(W=w|SPAM=s)$ is the fraction of the number of occurrences of word $w$ in all spams.\n","3. $\\theta_{w,h} = p(W=w|SPAM=h)$ is the fraction of the number of occurrences of word $w$ in all non-spams.\n","\n","The derivation is somewhat tedious, and requires the use of Langrange multipliers:\n"]},{"cell_type":"markdown","metadata":{"id":"NQOwEUn7YKk_","colab_type":"text"},"source":["Example:\n","\n","suppose our corpus has 4 documents:\n","1. \"buy much now\": SPAM\n","2. \"many dollars gain\": SPAM\n","3. \"like you much\": HAM\n","4. \"do your nice homework\": HAM\n","\n","Then:\n","$\\phi = p(\\text{SPAM}=s) = 2/4 = 0.5$\n","\n","$\\theta_{w,h}$ is given by the following table\n","\n","| | buy | much | now | dollars | gain | like | you/your | do | homework | nice |\n","|------|-----|------|-----|---------|------|------|----------|-----|----------|------|\n","| SPAM | 1/6 | 2/6 | 1/6 | 1/6 | 1/6 | 0/6 | 0/6 | 0/6 | 0/6 | 0/6 |\n","| HAM | 0/7 | 1/7 | 0/7 | 0/7 | 0/7 | 1/7 | 2/7 | 1/7 | 1/7 | 1/7 |\n","\n","To classify a new phrase \"much much gain\" we compute\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= p(\\text{SPAM}=s) p(\\text{much}|\\text{SPAM}=s)p(\\text{much}|\\text{SPAM}=s)p(\\text{gain}|\\text{SPAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 2/6 \\cdot 2/6 \\cdot 1/6 \\cdot 1/Z = 4/216 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","$$\n","\\begin{split}\n","p(\\text{HAM} = s | \\text{\"much much gain\"}) &= p(\\text{HAM}=s) p(\\text{much}|\\text{HAM}=s)p(\\text{much}|\\text{HAM}=s)p(\\text{gain}|\\text{HAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 1/7 \\cdot 1/7 \\cdot 0/6 \\cdot 1/Z = 0\n","\\end{split}\n","$$\n","\n","Thus the text is classified as SPAM."]},{"cell_type":"markdown","metadata":{"id":"i8I5sfiTfz3N","colab_type":"text"},"source":[" In fact all text with word \"gain\" will never be classified as non-spam, because \"gain\" never appeared in a non-spam document. This is a problem with modeling: for rare words, we are using maximum likelihood estimation to compute the frequencies. However MLE estimation doesn't work well with low data counts.\n"," \n"," Inspired by the Bayesian approach to polling (estimating counts) a common technique is called Laplace smoothing - assuming that each word in the vocabulary was seen at least once (or even a fraction of times) in each kind of document.\n"," \n"," With Laplace smoothing (assuming each word occurred 0.5 times in spam and 0.5 times in ham) the table becomes\n"," \n","| | buy | much | now | dollars | gain | like | you/your | do | homework | nice |\n","|------|-----|------|-----|---------|------|------|----------|-----|----------|------|\n","| SPAM | 1.5/11 | 2.5/11 | 1.5/11 | 1.5/11 | 1.5/11 | 0.5/11 | 0.5/11 | 0.5/11 | 0.5/11 | 0.5/11 |\n","| HAM | 0.5/12 | 1.5/12 | 0.5/12 | 0.5/12 | 0.5/12 | 1.5/12 | 2.5/12 | 1.5/12 | 1.5/12 | 1.5/12 |\n","\n","Now:\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= 1/2 \\cdot 2.5/11 \\cdot 2.5/11 \\cdot 1.5/11 \\cdot 1/Z = 9.375/2662 \\cdot 1/Z \\\\\n","p(\\text{SPAM} = h | \\text{\"much much gain\"}) &= 1/2 \\cdot 1.5/12 \\cdot 1.5/12 \\cdot 0.5/12 \\cdot 1/Z = 1.125/3456 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","Since $p(\\text{SPAM} = s | \\text{\"much much gain\"}) + p(\\text{SPAM} = h | \\text{\"much much gain\"}) = 1$ we can work out that $Z=0.00385$ and\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= 9.375/2662 \\cdot 1/Z = 0.915 = 91.5\\%\\\\\n","p(\\text{SPAM} = h | \\text{\"much much gain\"}) &= 1.125/3456 \\cdot 1/Z = 0.085 = 8.5\\%\n","\\end{split}\n","$$\n","\n","Thus the modelpredicts that with $91.5\\%$ the new text is SPAM."]},{"cell_type":"code","metadata":{"id":"C200U28dRM8a","colab_type":"code","colab":{}},"source":[""],"execution_count":0,"outputs":[]}]} \ No newline at end of file