#[fit] Ai 1
#[fit] Losses,
#[fit] Learning
##[fit] Minimize ##[fit] Cost
- how many dollars will you spend?
- what is your creditworthiness
- how many people will vote for Bernie t days before election
- use to predict probabilities for classification
- causal modeling in econometrics
Linear Regression
Cost Function:
###[fit] MINIMIZE SQUARED ERROR. Its Convex!
basically go opposite the direction of the derivative.
Consider the objective function: $$ J(x) = x^2-6x+5 $$
gradient = fprime(old_x)
move = gradient * step
current_x = old_x - move
where
ENTIRE DATASET NEEDED
for i in range(n_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad`
ONE POINT AT A TIME
For Linear Regression:
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
Mini-Batch SGD (the most used)
for i in range(mb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
[.autoscale: true]
- the risk surface changes at each gradient calculation
- thus things are noisy
- cumulated risk is smoother, can be used to compare to SGD
- epochs are now the number of times you revisit the full dataset
- shuffle in-between to provide even more stochasticity
#[fit] Ok, so where did this #[fit] Mean Squared Loss #[fit] come from?
$$\renewcommand{\v}[1]{\mathbf #1}$$
$$
\renewcommand{\v}[1]{\mathbf #1}
y_i \sim N(\v{w}\cdot\v{x_i}, \sigma^2) .$$
We can then write the likelihood:
$$
\renewcommand{\v}[1]{\mathbf #1}
\cal{L} = p(\v{y} | \v{x}, \v{w}, \sigma) = \prod_i p(\v{y}_i | \v{x}_i, \v{w}, \sigma)$$
$$\renewcommand{\v}[1]{\mathbf #1}
\cal{L} = (2\pi\sigma^2)^{(-n/2)} e^{\frac{-1}{2\sigma^2} \sum_i (y_i - \v{w}\cdot\v{x}_i)^2} .$$
The log likelihood
$$\renewcommand{\v}[1]{\mathbf #1}
\ell = \frac{-n}{2} log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_i (y_i - \v{w}\cdot\v{x}_i)^2 .$$
If all you care for are the parameters
##Minimize Loss or Cost:
Replace
So:
#[fit]CLASSIFICATION
- will a customer churn?
- is this a check? For how much?
- a man or a woman?
- will this customer buy?
- do you have cancer?
- is this spam?
- whose picture is this?
- what is this text about?1
##[fit] MLE for Logistic Regression
- example of a Generalized Linear Model (GLM)
- "Squeeze" linear regression through a Sigmoid function
- this bounds the output to be a probability
This function is plotted below:
h = lambda z: 1./(1+np.exp(-z))
zs=np.arange(-5,5,0.1)
plt.plot(zs, h(zs), alpha=0.5);
Identify:
$$\renewcommand{\v}[1]{\mathbf #1} z = \v{w}\cdot\v{x}$$
Then, the conditional probabilities of
$$\renewcommand{\v}[1]{\mathbf #1} \v{x}$$
$$\begin{eqnarray}
\renewcommand{\v}[1]{\mathbf #1}
P(y=1 | \v{x}) &=& h(\v{w}\cdot\v{x}) \\
P(y=0 | \v{x}) &=& 1 - h(\v{w}\cdot\v{x}).
\end{eqnarray}$$
These two can be written together as
$$\renewcommand{\v}[1]{\mathbf #1} P(y|\v{x}, \v{w}) = h(\v{w}\cdot\v{x})^y \left(1 - h(\v{w}\cdot\v{x}) \right)^{(1-y)} $$
BERNOULLI!!
Multiplying over the samples we get:
$$\renewcommand{\v}[1]{\mathbf #1} P(y|\v{x},\v{w}) = P({y_i} | {\v{x}i}, \v{w}) = \prod{y_i \in \cal{D}} P(y_i|\v{x_i}, \v{w}) = \prod_{y_i \in \cal{D}} h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}$$
Indeed its important to realize that a particular sample can be thought of as a draw from some "true" probability distribution.
maximum likelihood estimation maximises the likelihood of the sample y, or alternately the log-likelihood,
$$\renewcommand{\v}[1]{\mathbf #1} {\cal L} = P(y \mid \v{x},\v{w}).$$
$$\renewcommand{\v}[1]{\mathbf #1} \ell = log(P(y \mid \v{x},\v{w}))$$
Thus
$$\renewcommand{\v}[1]{\mathbf #1} \begin{eqnarray}
\ell &=& log\left(\prod_{y_i \in \cal{D}} h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}\right)\\
&=& \sum_{y_i \in \cal{D}} log\left(h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}\right)\\
&=& \sum_{y_i \in \cal{D}} log,h(\v{w}\cdot\v{x_i})^{y_i} + log,\left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}\\
&=& \sum_{y_i \in \cal{D}} \left ( y_i log(h(\v{w}\cdot\v{x})) + ( 1 - y_i) log(1 - h(\v{w}\cdot\v{x})) \right )
\end{eqnarray}$$
[.autoscale: true]
The negative of this log likelihood (NLL), also called cross-entropy.
$$\renewcommand{\v}[1]{\mathbf #1} NLL = - \sum_{y_i \in \cal{D}} \left ( y_i log(h(\v{w}\cdot\v{x})) + ( 1 - y_i) log(1 - h(\v{w}\cdot\v{x})) \right )$$
This loss is convex. Many ways can be chosen to minimize it, but we can just use gradient descent :-)
-
Identify
and as two separate probabilities constrained to add to 1. That is -
You can't use 'macro parameter character #' in math mode
$$\renewcommand{\v}[1]{\mathbf #1} p_{1i} = \frac{e^{\v{w_1} \cdot \v{x}}}{e^{\v{w_1} \cdot \v{x}} + e^{\v{w_2} \cdot \v{x}}}$$
-
You can't use 'macro parameter character #' in math mode
$$\renewcommand{\v}[1]{\mathbf #1} p_{2i} = \frac{e^{\v{w_2} \cdot \v{x}}}{e^{\v{w_1} \cdot \v{x}} + e^{\v{w_2} \cdot \v{x}}}$$
-
Can translate coefficients by fixed amount
without any change
$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^1 = \v{x_i}$$
$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^2 = (z^2_1, z^2_2) = (\v{w_1} \cdot \v{x_i}, \v{w_2} \cdot \v{x_i}) = (\v{w_1} \cdot \v{z^1_i}, \v{w_2} \cdot \v{z^1_i})$$
$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^3 = (z^3_1, z^3_2) = \left( LSM_1(z^2_1, z^2_2), LSM_2(z^2_1, z^2_2) \right)$$
$$\renewcommand{\v}[1]{\mathbf #1}
z^4 = NLL(\v{z}^3) = NLL(z^3_1, z^3_2) = - \sum_i \left( \mathbb1_1(y_i)z^3_1(i) + \mathbb1_2(y_i)z^3_1(i) \right)$$
$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^2 = (z^2_1, z^2_2) = (NN(\v{w_1} \cdot \v{x_i}), NN(\v{w_2} \cdot \v{x_i})) = (NN(\v{w_1} \cdot \v{z^1_i}), NN(\v{w_2} \cdot \v{z^1_i}))$$
Simply replace the linear regression here by a neural network.
Now find gradients for SGD.
#[fit]CLASSIFICATION
#[fit]BY LINEAR SEPARATION
#Which line?
-
Different Algorithms, different lines.
-
SVM uses max-margin1
#DISCRIMINATIVE CLASSIFIER
- are these classifiers any good?
- they are discriminative and draw boundaries, but thats it
- could it be better to have a classifier that captured the generative process of the data?
- Thats Ai 2
We want to obtain gradients. For example:
First we do the Forward Pass. Say we have 1 sample: (x=0.1, y=5). Initialize
Now it is time to find the gradients, for eg,
The basic idea is to gather all parts that go to
The parameters get updated. Now we repeat the forward pass.
Thats it! Wait for convergence.
Our recipe has been:
- Train a neural network until it overfits.
- Then add "regularization"
But what is Regularization.
Its a way to make parameters behave, to constrain them.
- Dropout
- L2/L1 Regularization: kernel regularization in Keras (also called weight decay)
- Early Stopping
- Simply add more data
- Data Augmentation
Dropout is a kind of pseudo layer.
- choose a probability
- randomly sever
of the connections from the previous layer in the training phase - this forces other connections to take up the slack and prevents them from over-specializing
- in the testing phase multiply the learned weights with this probability but dont sever connections
##[fit] Optimization
- has one parameter, the learning rate
- the learning rate is how much of the gradient to use in your parameter updates
- the more you use, the faster you can train, and thus you dont get stuck in local minima
- but if you keep it large you bounce around stochastically
-
it would be better to have a learning rate schedule, or adaptive learning rates, perhaps different for each parameter
-
SGD oscillates in high curvature areas
-
averaging the gradient: $$$$ over a few steps helps as it eliminates oscillations
-
momentum helps as well to keep you going straight down:
$$v_{t+1} = \alpha v_t - \eta t, w{t+1} = w_t + v$$
- Adagrad: $$r_{t+1, j} = r_{t, j} + {tj}^2, w{t+1} = w_t - \eta \frac{{tj}}{(\delta + \sqrt{r{tj}})}$$ has greater progress along gently sloped directions
-
RMSprop uses exponentially weighted average for gradient accumulation. Change to Adagrad:
- Adam adds momentum to RMSprop works well in practice:
$$r_{t+1, j} = \rho_r r_{t, j} + (1 - \rho_r){tj}^2$$, $$v{t+1, j} = \rho_v v_{t, j} + (1 - \rho_v){tj}^2$$, $$ w{t+1} = w_t - \eta \frac{v_{tj}}{(\delta + \sqrt{r_{tj}})}$$
- standardize your inputs(features) so no one input dominates
- we'll come back to Adam (later)
- parameter inittialization (Xavier vs He) (later)
- Batch normalization (later)
Footnotes
-
image from code in http://bit.ly/1Azg29G ↩ ↩2