forked from janchorowski/ml_uwr
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Jan Chorowski
committed
Oct 10, 2019
1 parent
ecb3bee
commit dbccc25
Showing
3 changed files
with
2 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"02_entropy.ipynb","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"j2l0x3S2XOXs","colab_type":"text"},"source":["# An intuitive introduction to the Entropy\n","\n","Let $X$ be a discrete random variable (RV) taking values from set $\\mathcal{X}$ with probability mass function $P(X)$.\n","\n","*Definition* the entropy $H(X)$ of the discrete random variable $X$ is\n","\\begin{equation}\n","H(X) = \\sum_{x\\in\\mathcal{X}}P(X)\\log \\frac{1}{P(X)} = -\\sum_{x\\in\\mathcal{X}}P(X)\\log P(X).\n","\\end{equation}\n","\n","How to make sense out of this definition? We'll, rather informally, argue below that the entrpy of an RV provides a lower bound on the amount of information provided by the RV, which we'll dfine as the average number of bits required to transmit the value the RV has taken.\n","\n","As a motivating example consider asking your friend for advice. The probabilities of his answers are given in the table below:\n","\n","| $x$ | $P(x)$ |\n","|----------|--------|\n","| OK | $1/2$ |\n","| Average | $1/4$ |\n","| Bad | $1/8$ |\n","| Terrible | $1/8$ |\n","\n","To transmit the answer of your friend you must introduce an *encoding*, e.g.:\n","\n","| $x$ | $P(x)$ | Code 1 |\n","|----------|--------|--------|\n","| OK | $1/2$ | 00 |\n","| Average | $1/4$ | 01 |\n","| Bad | $1/8$ | 10 |\n","| Terrible | $1/8$ | 11 |\n","\n","Under this encoding, we spend 2 bits per answer.\n","\n","However, we could also consider a variable length code, that uses shorter codewords for more frequent symbols:\n","\n","| $x$ | $P(x)$ | Code 2 |\n","|----------|--------|--------|\n","| OK | $1/2$ | 0 |\n","| Average | $1/4$ | 10 |\n","| Bad | $1/8$ | 110 |\n","| Terrible | $1/8$ | 111 |\n","\n","Under this encoding the average number of bits to encode an answer is:\n","\\begin{equation}\n","\\mathbb{E}[L] = \\frac{1}{2} \\cdot 1 + \\frac{1}{4} \\cdot 2 + \\frac{1}{8} \\cdot 3 + \\frac{1}{8} \\cdot 3 = \\frac{7}{8}\n","\\end{equation}\n","\n","Thus, the new code is more efficient. Is it the best we can do?"]},{"cell_type":"markdown","metadata":{"id":"uOpiKOWvn92s","colab_type":"text"},"source":["### The code space\n","\n","We'll now try formalize the coding task, i.e. the assignment of code lengths to possible values of the RV. \n","\n","Let's first observe an important property of our code: in a variable length coding, no codeword can be the prefix of another one. Otherwise, decoding is not deterministic. Therefore, whenever a value is assigned a symbol of length $L$, $1/2^L$ of the code space is reserved and not available to other codes.\n","\n","This can be visualised as a code space. Below, we indicate the codes assigned to symbols in the example and grey-out codes that are not available because the shorter codes are used:\n","\n","<img src=\"\" />\n","\n","We can observe, that the length 1 code for \"OK\" uses $1/2$ of the available codes, the langth 2 code for \"Average\" uses $1/4$ and the two length 3 codes for \"Bad\" and \"Terrible\" each use $1/8$ of the code space.\n","\n","In general, a code of length $L$ uses $1/2^L$ of the code space. Equivalently, assignign a fraction $f$ of the code space to a symbol makes it use a symbol of length $L=\\log_2(1/f)$."]},{"cell_type":"markdown","metadata":{"id":"T_jOnTuVo0jj","colab_type":"text"},"source":["Assuming that we assign a fraction of a bit to a symbol, our optmal coding problem can be formulated as partitioning the code space into four regions (one for each value of the RV) such that the average length of the code is minised. \n","\n","Formally, let $p_1, p_2, p_3, p_4$ be the proibabilities asigned to the 4 symbols and let $f_1, f_2, f_3, f_4$ be the coding space fractoins assigned to them.\n","\n","We want to:\n","\\begin{align}\n","\\text{minimize } &p_1 \\log_2 \\frac{1}{f_1} + p_2 \\log_2 \\frac{1}{f_2} + p_3 \\log_2 \\frac{1}{f_3} + p_4 \\log_2 \\frac{1}{f_4} \\\\\n","\\text{subject to: } & f_1 + f_2 + f_3 + f_4 = 1\n","\\end{align}\n","\n","For simplicity, we will solve this problem for the case of only two symbols:\n","\\begin{align}\n","\\text{minimize } &p_1 \\log_2 \\frac{1}{f_1} + p_2 \\log_2 \\frac{1}{f_2} \\\\\n","\\text{subject to: } & f_1 + f_2 = 1\n","\\end{align}\n","\n","Notice first that $p_2 = 1-p_1$ and likewise $f_2 = 1-f_1$. Then our minimization objective becomes\n","\\begin{equation}\n","\\text{minimize } C = p_1 \\log_2 \\frac{1}{f_1} + (1-p_1) \\log_2 \\frac{1}{1-f_1} \n","\\end{equation}\n","\n","To get the minimum over $f_1$ we compute the derivative of the expression with respect to $f_1$ and set it to zero:\n","\\begin{equation}\n","\\frac{\\partial C}{\\partial f_1} = \\frac{p_1}{\\log 2}\\frac{-1}{f_1} + \\frac{1 - p_1}{\\log 2}\\frac{1}{1 - f_1}\n","\\end{equation}\n","\n","Multiplying both sides by $\\log 2 f_1 (1- f_1)$ we obtain:\n","\\begin{align}\n","p_1(1-f_1) &= (1-p_1)f_1 \\\\\n","p_1 - p_1f_1 & =f_1 - p_1f_1 \\\\\n","f_1 &= p_1\n","\\end{align}\n","\n","Thus the optimal fraction of code space allocated to symbol 1 is $p_1$, the probability assigned to this symbol and the optimal code length is $\\log_2(\\frac{1}{p_1})$!\n","\n","We now see, that the entropy \n","\\begin{equation}\n","H(X) = \\sum_{x\\in\\mathcal{X}}P(X)\\log \\frac{1}{P(X)}\n","\\end{equation}\n","is simply the average code length!\n","\n","### A note about logarithm basis\n","\n","It is custommary to copute the entropy using natural logarithms, which gives its value in \"nats\". IF $\\log_2$ were used, the entrpy has units of bits and corresponds and lowerbounds the average amount of bits needed to transmit a value of the RV."]},{"cell_type":"markdown","metadata":{"id":"KOBFV5X6aeHg","colab_type":"text"},"source":["# Further Reading\n","1. Chris Olah \"Visual Information theory\": https://colah.github.io/posts/2015-09-Visual-Information/\n","2. JA Thomas ad TM Cover, \"Elements of Information Theory\", chapter 2"]}]} |