Skip to content

Files

Latest commit

f520d5e · Sep 29, 2019

History

History
593 lines (340 loc) · 13.3 KB

lecture.md

File metadata and controls

593 lines (340 loc) · 13.3 KB

footer:30%, filtered autoscale: true

inline


#[fit] Ai 1


#[fit] Losses,

and

#[fit] Learning


##[fit] Minimize ##[fit] Cost



Regression: continuous prediction

  • how many dollars will you spend?
  • what is your creditworthiness
  • how many people will vote for Bernie t days before election
  • use to predict probabilities for classification
  • causal modeling in econometrics

fit, right


Linear Regression

y ^ = f θ ( x ) = θ T x

Cost Function:

R ( θ ) = 1 2 i = 1 m ( f θ ( x ( i ) y ( i ) ) 2

###[fit] MINIMIZE SQUARED ERROR. Its Convex!



Gradient ascent (descent)

basically go opposite the direction of the derivative.

Consider the objective function: $$ J(x) = x^2-6x+5 $$

gradient = fprime(old_x)
move = gradient * step
current_x = old_x - move

right, fit


good step size

inline


too big step size

inline


too small step size

inline


Gradient Descent

θ := θ η θ R ( θ ) = θ η i = 1 m R i ( θ )

where η is the learning rate.

ENTIRE DATASET NEEDED

for i in range(n_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad`

Linear Regression: Gradient Descent

θ j := θ j + α i = 1 m ( y ( i ) f θ ( x ( i ) ) ) x j ( i )

right


Stochastic Gradient Descent

θ := θ α θ R i ( θ )

ONE POINT AT A TIME

For Linear Regression:

θ j := θ j + α ( y ( i ) f θ ( x ( i ) ) ) x j ( i )

for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad

Mini-Batch SGD (the most used)

θ := θ η θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) )

for i in range(mb_epochs):
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params)
    params = params - learning_rate * params_grad

[.autoscale: true]

Mini-Batch: do some at a time

  • the risk surface changes at each gradient calculation
  • thus things are noisy
  • cumulated risk is smoother, can be used to compare to SGD
  • epochs are now the number of times you revisit the full dataset
  • shuffle in-between to provide even more stochasticity

inline


fit, inline


#[fit] Ok, so where did this #[fit] Mean Squared Loss #[fit] come from?


Gaussian Distribution assumption !

inline


Gaussian Distribution assumption

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}$$
Each y i is gaussian distributed with mean w x i (the y predicted by the regression line) and variance σ 2 :

You can't use 'macro parameter character #' in math mode

$$
\renewcommand{\v}[1]{\mathbf #1}
y_i \sim N(\v{w}\cdot\v{x_i}, \sigma^2) .$$

N ( μ , σ 2 ) = 1 σ 2 π e ( y μ ) 2 / 2 σ 2 ,


We can then write the likelihood:

You can't use 'macro parameter character #' in math mode

$$
\renewcommand{\v}[1]{\mathbf #1}
\cal{L} = p(\v{y} | \v{x}, \v{w}, \sigma) = \prod_i p(\v{y}_i | \v{x}_i, \v{w}, \sigma)$$

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
\cal{L} =  (2\pi\sigma^2)^{(-n/2)} e^{\frac{-1}{2\sigma^2} \sum_i (y_i -  \v{w}\cdot\v{x}_i)^2} .$$

The log likelihood then is given by:

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
\ell = \frac{-n}{2} log(2\pi\sigma^2) - \frac{1}{2\sigma^2}  \sum_i (y_i -  \v{w}\cdot\v{x}_i)^2 .$$


Maximize ?

No. Minimize using Gradient Descent!

If all you care for are the parameters w , then:

##Minimize Loss or Cost:

L o s s = 1 N i ( y i \v w \v x i ) 2 .


Where's the neural network?

Replace \v w \v x i ) 2 with N N ( \v w \v x i ) 2 ) .

So:

L o s s = 1 N i ( y i N N , ( \v w \v x i ) 2 ) .


The Loss is now NOT CONVEX!

inline


#[fit]CLASSIFICATION

  • will a customer churn?
  • is this a check? For how much?
  • a man or a woman?
  • will this customer buy?
  • do you have cancer?
  • is this spam?
  • whose picture is this?
  • what is this text about?1

fit, left


1-D classification problem

inline


1-D Using Linear regression

inlineinline


##[fit] MLE for Logistic Regression

  • example of a Generalized Linear Model (GLM)
  • "Squeeze" linear regression through a Sigmoid function
  • this bounds the output to be a probability

Sigmoid function

This function is plotted below:

h = lambda z: 1./(1+np.exp(-z))
zs=np.arange(-5,5,0.1)
plt.plot(zs, h(zs), alpha=0.5);

right, fit

Identify:

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} z = \v{w}\cdot\v{x}$$
and $$ \renewcommand{\v}[1]{\mathbf #1} h(\v{w}\cdot\v{x})$$ with the probability that the sample is a '1' ( y = 1 ).


Then, the conditional probabilities of y = 1 or y = 0 given a particular sample's features

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} \v{x}$$
are:

You can't use 'macro parameter character #' in math mode

$$\begin{eqnarray}
\renewcommand{\v}[1]{\mathbf #1}
P(y=1 | \v{x}) &=& h(\v{w}\cdot\v{x}) \\
P(y=0 | \v{x}) &=& 1 - h(\v{w}\cdot\v{x}).
\end{eqnarray}$$

These two can be written together as

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} P(y|\v{x}, \v{w}) = h(\v{w}\cdot\v{x})^y \left(1 - h(\v{w}\cdot\v{x}) \right)^{(1-y)} $$

BERNOULLI!!


Multiplying over the samples we get:

$$\renewcommand{\v}[1]{\mathbf #1} P(y|\v{x},\v{w}) = P({y_i} | {\v{x}i}, \v{w}) = \prod{y_i \in \cal{D}} P(y_i|\v{x_i}, \v{w}) = \prod_{y_i \in \cal{D}} h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}$$

Indeed its important to realize that a particular sample can be thought of as a draw from some "true" probability distribution.

maximum likelihood estimation maximises the likelihood of the sample y, or alternately the log-likelihood,

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} {\cal L} = P(y \mid \v{x},\v{w}).$$
OR
You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} \ell = log(P(y \mid \v{x},\v{w}))$$


Thus

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} \begin{eqnarray}
\ell &=& log\left(\prod_{y_i \in \cal{D}} h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}\right)\\
&=& \sum_{y_i \in \cal{D}} log\left(h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}\right)\\
&=& \sum_{y_i \in \cal{D}} log,h(\v{w}\cdot\v{x_i})^{y_i} + log,\left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}\\
&=& \sum_{y_i \in \cal{D}} \left ( y_i log(h(\v{w}\cdot\v{x})) + ( 1 - y_i) log(1 - h(\v{w}\cdot\v{x})) \right )
\end{eqnarray}$$


[.autoscale: true]

Logistic Regression: NLL

The negative of this log likelihood (NLL), also called cross-entropy.

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1} NLL = - \sum_{y_i \in \cal{D}} \left ( y_i log(h(\v{w}\cdot\v{x})) + ( 1 - y_i) log(1 - h(\v{w}\cdot\v{x})) \right )$$

This loss is convex. Many ways can be chosen to minimize it, but we can just use gradient descent :-)


1-D Using Logistic regression

inlineinline


In diagram:

inline


Softmax formulation

  • Identify p i and 1 p i as two separate probabilities constrained to add to 1. That is p 1 i = p i ; p 2 i = 1 p i .

  • You can't use 'macro parameter character #' in math mode

    $$\renewcommand{\v}[1]{\mathbf #1}
    p_{1i} = \frac{e^{\v{w_1} \cdot \v{x}}}{e^{\v{w_1} \cdot \v{x}} + e^{\v{w_2} \cdot \v{x}}}$$

  • You can't use 'macro parameter character #' in math mode

    $$\renewcommand{\v}[1]{\mathbf #1}
    p_{2i} = \frac{e^{\v{w_2} \cdot \v{x}}}{e^{\v{w_1} \cdot \v{x}} + e^{\v{w_2} \cdot \v{x}}}$$

  • Can translate coefficients by fixed amount ψ without any change


NLL for Softmax

L = i p 1 i 1 1 ( y i ) p 2 i 1 2 ( y i )

N L L = i ( 1 1 ( y i ) l o g ( p 1 i ) + 1 2 ( y i ) l o g ( p 2 i ) )


Units diagram for Softmax

inline


Write it in Layers

inline


Equations, layer by layer

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^1 = \v{x_i}$$

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^2 = (z^2_1, z^2_2) = (\v{w_1} \cdot \v{x_i}, \v{w_2} \cdot \v{x_i}) = (\v{w_1} \cdot \v{z^1_i}, \v{w_2} \cdot \v{z^1_i})$$

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^3 = (z^3_1, z^3_2) = \left( LSM_1(z^2_1, z^2_2), LSM_2(z^2_1, z^2_2) \right)$$

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
z^4 = NLL(\v{z}^3) = NLL(z^3_1, z^3_2) = - \sum_i \left( \mathbb1_1(y_i)z^3_1(i) + \mathbb1_2(y_i)z^3_1(i) \right)$$


Dude, where is my network?

You can't use 'macro parameter character #' in math mode

$$\renewcommand{\v}[1]{\mathbf #1}
\v{z}^2 = (z^2_1, z^2_2) = (NN(\v{w_1} \cdot \v{x_i}), NN(\v{w_2} \cdot \v{x_i})) = (NN(\v{w_1} \cdot \v{z^1_i}), NN(\v{w_2} \cdot \v{z^1_i}))$$

Simply replace the linear regression here by a neural network.

Now find gradients for SGD.


#[fit]CLASSIFICATION

#[fit]BY LINEAR SEPARATION

#Which line?

  • Different Algorithms, different lines.

  • SVM uses max-margin1

fit, right


#DISCRIMINATIVE CLASSIFIER P ( y | x ) : P ( m a l e | h e i g h t , w e i g h t )

inline, fitinline, fit


Discriminative Learning

  • are these classifiers any good?
  • they are discriminative and draw boundaries, but thats it
  • could it be better to have a classifier that captured the generative process of the data?
  • Thats Ai 2

Calculating Gradients for learning: Backprop (Intuition)


inline


Simple MLP for regression

inline


fit


Forward Pass

We want to obtain gradients. For example: C o s t p a r a m = z 6 w 1

First we do the Forward Pass. Say we have 1 sample: (x=0.1, y=5). Initialize b 1 , w 1 , b 2 , w 2 , w o 1 , w o 2 , b o . Then, plugging in the numbers wi ll give us some Cost ( z 5 , z 6 ).

inline


Backward Pass

Now it is time to find the gradients, for eg, z 6 w 1

The basic idea is to gather all parts that go to w 1 , and so on and so forth. Now we perform GD (SGD) with some learning rate.

The parameters get updated. Now we repeat the forward pass.

Thats it! Wait for convergence.

right, fit


Losses in Neural Nets

inline


How to fit?

Our recipe has been:

  • Train a neural network until it overfits.
  • Then add "regularization"

But what is Regularization.

Its a way to make parameters behave, to constrain them.


Many styles of regularization

  • Dropout
  • L2/L1 Regularization: kernel regularization in Keras (also called weight decay)
  • Early Stopping
  • Simply add more data
  • Data Augmentation

Dropout

left, fit

Dropout is a kind of pseudo layer.

  • choose a probability p
  • randomly sever p of the connections from the previous layer in the training phase
  • this forces other connections to take up the slack and prevents them from over-specializing
  • in the testing phase multiply the learned weights with this probability but dont sever connections

L2 Norm

inlineinline


Early Stopping

inline


##[fit] Optimization


The workhorse: SGD

  • has one parameter, the learning rate
  • the learning rate is how much of the gradient to use in your parameter updates
  • the more you use, the faster you can train, and thus you dont get stuck in local minima
  • but if you keep it large you bounce around stochastically

Messing with SGD

  • it would be better to have a learning rate schedule, or adaptive learning rates, perhaps different for each parameter

  • SGD oscillates in high curvature areas

  • averaging the gradient: $$$$ over a few steps helps as it eliminates oscillations

  • momentum helps as well to keep you going straight down:

    $$v_{t+1} = \alpha v_t - \eta t, w{t+1} = w_t + v$$

right, fit


Other Variants

  • Adagrad: $$r_{t+1, j} = r_{t, j} + {tj}^2, w{t+1} = w_t - \eta \frac{{tj}}{(\delta + \sqrt{r{tj}})}$$ has greater progress along gently sloped directions
  • RMSprop uses exponentially weighted average for gradient accumulation. Change to Adagrad: r t + 1 , j = ρ r t , j + ( 1 ρ ) t j 2
  • Adam adds momentum to RMSprop works well in practice:

$$r_{t+1, j} = \rho_r r_{t, j} + (1 - \rho_r){tj}^2$$, $$v{t+1, j} = \rho_v v_{t, j} + (1 - \rho_v){tj}^2$$, $$ w{t+1} = w_t - \eta \frac{v_{tj}}{(\delta + \sqrt{r_{tj}})}$$


Other considerations

  • standardize your inputs(features) so no one input dominates
  • we'll come back to Adam (later)
  • parameter inittialization (Xavier vs He) (later)
  • Batch normalization (later)

Footnotes

  1. image from code in http://bit.ly/1Azg29G 2