Lecture notes

thegunner157 · Oct 31, 2019 · bab0ddf · bab0ddf
1 parent d1b3b36
commit bab0ddf
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 0 deletions.
diff --git a/lectures/03_04_materials.md b/lectures/03_04_materials.md
@@ -0,0 +1,7 @@
+Good materials for Decision trees are provided by:
+- The argmax.ai/TUM course: https://argmax.ai/ml-course/lecture-02-trees/
+- In CS226 lecture notes http://cs229.stanford.edu/notes/cs229-notes-dt.pdf
+
+For the Random Forest an excellent reference is Breiman's paper and tutorial:
+- https://link.springer.com/article/10.1023/A:1010933404324
+- https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
diff --git a/lectures/03_naive_bayes.ipynb b/lectures/03_naive_bayes.ipynb
@@ -0,0 +1 @@
+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"02_naive_bayes.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"jvMZ7p-XlTNz","colab_type":"text"},"source":["# Naive Bayes classifiers"]},{"cell_type":"markdown","metadata":{"id":"ZTwQ-esHXO9R","colab_type":"text"},"source":["A naive Bayes classifier uses the Bayes theorem to classify data. It is frequently used to filter out SPAM documents. To classify a document as SPAM/NONSPAM (HAM) we want\n","\n","\\begin{equation}\n","    p(\\text{SPAM}|\\text{TEXT})\n","\\end{equation}\n","\n","using the Bayes theorem we get\n","\n","\\begin{equation}\n","    p(\\text{SPAM}|\\text{TEXT}) =\n","    \\frac{p(\\text{TEXT}|\\text{SPAM})p(\\text{SPAM})}{p(\\text{TEXT})}\n","\\end{equation}\n","\n","The Bayes theorem allows us to express a classification problem as a generation problem: we will create a model for generating texts ($p(\\text{TEXT}|\\text{SPAM})$) and combine it with the prior probability of getting a spam ($p(\\text{SPAM})$). However, we will not need  $p(\\text{TEXT})$: the probability of ever seeing a given document.\n","\n","To estimate $p(\\text{TEXT}|\\text{SPAM})$ we need to define a data generation model. A text is a sequence of words:\n","$$\n","\\text{TEXT} = W_1, W_2, W_3,\\ldots,W_n.\n","$$\n","Thus, \n","$$\n","p(\\text{TEXT}|\\text{SPAM}) = p(W_1|\\text{SPAM})p(W_2|W_1,\\text{SPAM})p(W_n|W_1, ..., W_{n-1}, \\text{SPAM})\n","$$\n","We will further simply this by (naively) assuming that \n","\n","\\begin{equation}\n","    \\begin{split}\n"," p(\\text{TEXT}|\\text{SPAM}) &=  (W_1|\\text{SPAM})p(W_2|W_1,\\text{SPAM})p(W_n|W_1, ..., W_{n-1}, \\text{SPAM}) \\\\\n"," &\\approx p(W_1|\\text{SPAM})p(W_2|\\text{SPAM})p(W_n|\\text{SPAM}) \\\\\n"," &= \\prod_{W_i \\in \\text{TEXT}}p(W_i|\\text{SPAM})\n","\\end{split}\n","\\end{equation}\n","\n","This corresponds to a generative models in which the sender first flips a biased coin to see if the generated document will be a spam or ham one. Then, he picks a box labeled spam or ham. Finally, the sender draws with replacement words from the appropriate box.\n","\n","The full sampling model has the following parameters:\n","1. $\\phi$ - the probability of generating a SPAM.\n","2. $\\theta_{w,s}$ - the probability of generating word $w$ in a SPAM document, $\\sum_w \\theta_{w,s}=1$,\n","3. $\\theta_{w,h}$ - the probability of generating word $w$ in a HAM document, $\\sum_w \\theta_{w,h}=1$.\n","\n","All parameters are easy to estimate using maximum likelihood principle:\n","1. $\\phi = p(\\text{SPAM}=s)$ is just the fraction of all spams in our corpus.\n","2. $\\theta_{w,s} = p(W=w|SPAM=s)$ is the fraction of the number of occurrences of word $w$ in all spams.\n","3. $\\theta_{w,h} = p(W=w|SPAM=h)$ is the fraction of the number of occurrences of word $w$ in all non-spams.\n","\n","The derivation is somewhat tedious, and requires the use of Langrange multipliers:\n"]},{"cell_type":"markdown","metadata":{"id":"NQOwEUn7YKk_","colab_type":"text"},"source":["Example:\n","\n","suppose our corpus has 4 documents:\n","1. \"buy much now\": SPAM\n","2. \"many dollars gain\": SPAM\n","3. \"like you much\": HAM\n","4. \"do your nice homework\": HAM\n","\n","Then:\n","$\\phi = p(\\text{SPAM}=s) = 2/4 = 0.5$\n","\n","$\\theta_{w,h}$ is given by the following table\n","\n","|       | buy | much | now | dollars | gain | like | you/your | do  | homework | nice |\n","|------|-----|------|-----|---------|------|------|----------|-----|----------|------|\n","| SPAM | 1/6 | 2/6  | 1/6 | 1/6     | 1/6  | 0/6  | 0/6      | 0/6 | 0/6      | 0/6  |\n","| HAM  | 0/7 | 1/7  | 0/7 | 0/7     | 0/7  | 1/7  | 2/7      | 1/7 | 1/7      | 1/7  |\n","\n","To classify a new phrase \"much much gain\" we compute\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= p(\\text{SPAM}=s) p(\\text{much}|\\text{SPAM}=s)p(\\text{much}|\\text{SPAM}=s)p(\\text{gain}|\\text{SPAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 2/6 \\cdot 2/6  \\cdot 1/6  \\cdot 1/Z = 4/216 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","$$\n","\\begin{split}\n","p(\\text{HAM} = s | \\text{\"much much gain\"}) &= p(\\text{HAM}=s) p(\\text{much}|\\text{HAM}=s)p(\\text{much}|\\text{HAM}=s)p(\\text{gain}|\\text{HAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 1/7 \\cdot 1/7  \\cdot 0/6  \\cdot 1/Z = 0\n","\\end{split}\n","$$\n","\n","Thus the text is classified as SPAM."]},{"cell_type":"markdown","metadata":{"id":"i8I5sfiTfz3N","colab_type":"text"},"source":[" In fact all text with word \"gain\" will never be classified as non-spam, because \"gain\" never appeared in a non-spam document. This is a problem with modeling: for rare words, we are using maximum likelihood estimation to compute the frequencies. However MLE estimation doesn't work well with low data counts.\n"," \n"," Inspired by the Bayesian approach to polling (estimating counts) a common technique is called Laplace smoothing - assuming that each word in the vocabulary was seen at least once (or even a fraction of times) in each kind of document.\n"," \n"," With Laplace smoothing (assuming each word occurred 0.5 times in spam and 0.5 times in ham) the table becomes\n"," \n","|       | buy | much | now | dollars | gain | like | you/your | do  | homework | nice |\n","|------|-----|------|-----|---------|------|------|----------|-----|----------|------|\n","| SPAM | 1.5/11 | 2.5/11  | 1.5/11 | 1.5/11     | 1.5/11  | 0.5/11  | 0.5/11      | 0.5/11 | 0.5/11      | 0.5/11  |\n","| HAM  | 0.5/12 | 1.5/12  | 0.5/12 | 0.5/12     | 0.5/12  | 1.5/12  | 2.5/12      | 1.5/12 | 1.5/12      | 1.5/12  |\n","\n","Now:\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= 1/2 \\cdot 2.5/11 \\cdot 2.5/11  \\cdot 1.5/11  \\cdot 1/Z = 9.375/2662 \\cdot 1/Z \\\\\n","p(\\text{SPAM} = h | \\text{\"much much gain\"}) &= 1/2 \\cdot 1.5/12 \\cdot 1.5/12  \\cdot 0.5/12  \\cdot 1/Z = 1.125/3456 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","Since $p(\\text{SPAM} = s | \\text{\"much much gain\"})  + p(\\text{SPAM} = h | \\text{\"much much gain\"})  = 1$ we can work out that $Z=0.00385$ and\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s | \\text{\"much much gain\"}) &= 9.375/2662 \\cdot 1/Z  = 0.915 = 91.5\\%\\\\\n","p(\\text{SPAM} = h | \\text{\"much much gain\"}) &= 1.125/3456 \\cdot 1/Z = 0.085 = 8.5\\%\n","\\end{split}\n","$$\n","\n","Thus the modelpredicts that with $91.5\\%$ the new text is SPAM."]},{"cell_type":"code","metadata":{"id":"C200U28dRM8a","colab_type":"code","colab":{}},"source":[""],"execution_count":0,"outputs":[]}]}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"02_naive_bayes.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"jvMZ7p-XlTNz","colab_type":"text"},"source":["# Naive Bayes classifiers"]},{"cell_type":"markdown","metadata":{"id":"ZTwQ-esHXO9R","colab_type":"text"},"source":["A naive Bayes classifier uses the Bayes theorem to classify data. It is frequently used to filter out SPAM documents. To classify a document as SPAM/NONSPAM (HAM) we want\n","\n","\\begin{equation}\n"," p(\\text{SPAM}\|\\text{TEXT})\n","\\end{equation}\n","\n","using the Bayes theorem we get\n","\n","\\begin{equation}\n"," p(\\text{SPAM}\|\\text{TEXT}) =\n"," \\frac{p(\\text{TEXT}\|\\text{SPAM})p(\\text{SPAM})}{p(\\text{TEXT})}\n","\\end{equation}\n","\n","The Bayes theorem allows us to express a classification problem as a generation problem: we will create a model for generating texts ($p(\\text{TEXT}\|\\text{SPAM})$) and combine it with the prior probability of getting a spam ($p(\\text{SPAM})$). However, we will not need $p(\\text{TEXT})$: the probability of ever seeing a given document.\n","\n","To estimate $p(\\text{TEXT}\|\\text{SPAM})$ we need to define a data generation model. A text is a sequence of words:\n","$$\n","\\text{TEXT} = W_1, W_2, W_3,\\ldots,W_n.\n","$$\n","Thus, \n","$$\n","p(\\text{TEXT}\|\\text{SPAM}) = p(W_1\|\\text{SPAM})p(W_2\|W_1,\\text{SPAM})p(W_n\|W_1, ..., W_{n-1}, \\text{SPAM})\n","$$\n","We will further simply this by (naively) assuming that \n","\n","\\begin{equation}\n"," \\begin{split}\n"," p(\\text{TEXT}\|\\text{SPAM}) &= (W_1\|\\text{SPAM})p(W_2\|W_1,\\text{SPAM})p(W_n\|W_1, ..., W_{n-1}, \\text{SPAM}) \\\\\n"," &\\approx p(W_1\|\\text{SPAM})p(W_2\|\\text{SPAM})p(W_n\|\\text{SPAM}) \\\\\n"," &= \\prod_{W_i \\in \\text{TEXT}}p(W_i\|\\text{SPAM})\n","\\end{split}\n","\\end{equation}\n","\n","This corresponds to a generative models in which the sender first flips a biased coin to see if the generated document will be a spam or ham one. Then, he picks a box labeled spam or ham. Finally, the sender draws with replacement words from the appropriate box.\n","\n","The full sampling model has the following parameters:\n","1. $\\phi$ - the probability of generating a SPAM.\n","2. $\\theta_{w,s}$ - the probability of generating word $w$ in a SPAM document, $\\sum_w \\theta_{w,s}=1$,\n","3. $\\theta_{w,h}$ - the probability of generating word $w$ in a HAM document, $\\sum_w \\theta_{w,h}=1$.\n","\n","All parameters are easy to estimate using maximum likelihood principle:\n","1. $\\phi = p(\\text{SPAM}=s)$ is just the fraction of all spams in our corpus.\n","2. $\\theta_{w,s} = p(W=w\|SPAM=s)$ is the fraction of the number of occurrences of word $w$ in all spams.\n","3. $\\theta_{w,h} = p(W=w\|SPAM=h)$ is the fraction of the number of occurrences of word $w$ in all non-spams.\n","\n","The derivation is somewhat tedious, and requires the use of Langrange multipliers:\n"]},{"cell_type":"markdown","metadata":{"id":"NQOwEUn7YKk_","colab_type":"text"},"source":["Example:\n","\n","suppose our corpus has 4 documents:\n","1. \"buy much now\": SPAM\n","2. \"many dollars gain\": SPAM\n","3. \"like you much\": HAM\n","4. \"do your nice homework\": HAM\n","\n","Then:\n","$\\phi = p(\\text{SPAM}=s) = 2/4 = 0.5$\n","\n","$\\theta_{w,h}$ is given by the following table\n","\n","\| \| buy \| much \| now \| dollars \| gain \| like \| you/your \| do \| homework \| nice \|\n","\|------\|-----\|------\|-----\|---------\|------\|------\|----------\|-----\|----------\|------\|\n","\| SPAM \| 1/6 \| 2/6 \| 1/6 \| 1/6 \| 1/6 \| 0/6 \| 0/6 \| 0/6 \| 0/6 \| 0/6 \|\n","\| HAM \| 0/7 \| 1/7 \| 0/7 \| 0/7 \| 0/7 \| 1/7 \| 2/7 \| 1/7 \| 1/7 \| 1/7 \|\n","\n","To classify a new phrase \"much much gain\" we compute\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s \| \\text{\"much much gain\"}) &= p(\\text{SPAM}=s) p(\\text{much}\|\\text{SPAM}=s)p(\\text{much}\|\\text{SPAM}=s)p(\\text{gain}\|\\text{SPAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 2/6 \\cdot 2/6 \\cdot 1/6 \\cdot 1/Z = 4/216 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","$$\n","\\begin{split}\n","p(\\text{HAM} = s \| \\text{\"much much gain\"}) &= p(\\text{HAM}=s) p(\\text{much}\|\\text{HAM}=s)p(\\text{much}\|\\text{HAM}=s)p(\\text{gain}\|\\text{HAM}=s) / p(\\text{TEXT} = \\text{\"much much gain\"}) = \\\\\n","&= 1/2 \\cdot 1/7 \\cdot 1/7 \\cdot 0/6 \\cdot 1/Z = 0\n","\\end{split}\n","$$\n","\n","Thus the text is classified as SPAM."]},{"cell_type":"markdown","metadata":{"id":"i8I5sfiTfz3N","colab_type":"text"},"source":[" In fact all text with word \"gain\" will never be classified as non-spam, because \"gain\" never appeared in a non-spam document. This is a problem with modeling: for rare words, we are using maximum likelihood estimation to compute the frequencies. However MLE estimation doesn't work well with low data counts.\n"," \n"," Inspired by the Bayesian approach to polling (estimating counts) a common technique is called Laplace smoothing - assuming that each word in the vocabulary was seen at least once (or even a fraction of times) in each kind of document.\n"," \n"," With Laplace smoothing (assuming each word occurred 0.5 times in spam and 0.5 times in ham) the table becomes\n"," \n","\| \| buy \| much \| now \| dollars \| gain \| like \| you/your \| do \| homework \| nice \|\n","\|------\|-----\|------\|-----\|---------\|------\|------\|----------\|-----\|----------\|------\|\n","\| SPAM \| 1.5/11 \| 2.5/11 \| 1.5/11 \| 1.5/11 \| 1.5/11 \| 0.5/11 \| 0.5/11 \| 0.5/11 \| 0.5/11 \| 0.5/11 \|\n","\| HAM \| 0.5/12 \| 1.5/12 \| 0.5/12 \| 0.5/12 \| 0.5/12 \| 1.5/12 \| 2.5/12 \| 1.5/12 \| 1.5/12 \| 1.5/12 \|\n","\n","Now:\n","\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s \| \\text{\"much much gain\"}) &= 1/2 \\cdot 2.5/11 \\cdot 2.5/11 \\cdot 1.5/11 \\cdot 1/Z = 9.375/2662 \\cdot 1/Z \\\\\n","p(\\text{SPAM} = h \| \\text{\"much much gain\"}) &= 1/2 \\cdot 1.5/12 \\cdot 1.5/12 \\cdot 0.5/12 \\cdot 1/Z = 1.125/3456 \\cdot 1/Z\n","\\end{split}\n","$$\n","\n","Since $p(\\text{SPAM} = s \| \\text{\"much much gain\"}) + p(\\text{SPAM} = h \| \\text{\"much much gain\"}) = 1$ we can work out that $Z=0.00385$ and\n","$$\n","\\begin{split}\n","p(\\text{SPAM} = s \| \\text{\"much much gain\"}) &= 9.375/2662 \\cdot 1/Z = 0.915 = 91.5\\%\\\\\n","p(\\text{SPAM} = h \| \\text{\"much much gain\"}) &= 1.125/3456 \\cdot 1/Z = 0.085 = 8.5\\%\n","\\end{split}\n","$$\n","\n","Thus the modelpredicts that with $91.5\\%$ the new text is SPAM."]},{"cell_type":"code","metadata":{"id":"C200U28dRM8a","colab_type":"code","colab":{}},"source":[""],"execution_count":0,"outputs":[]}]}