Presented by Dr. Tommi Jaakkola, MIT, Nov 4 2021
Notes by Emily Liu
Generative models are useful because they allow us to generate diverse, high quality objects (images, time series, sentences, molecules, etc).
To do this, we estimate a latent variable model
Consider a single value of
If we run the EM algorithm with no constraints on Q, the ELBO max results in
However,
Let us consider a simple case with two latent variables:
We optimize iteratively (first fix
The
Solving, we get that
Variational autoencoders utilize a parametric model
Generative models generate new objects as follows:
- First, sample
$z \sim P(z)$ from a simple fixed distribution. - Then, map the resulting
$z$ value through a deep model$x = g(z; \theta)$ - The generated objects tend to be noisy; in other words,
$P(x | z, \theta) = N(x; g(z; \theta); \sigma^2 I)$
This model is difficult to train since we don't know which
The ELBO objective is as follows:
- Given
$x$ , pass$x$ through the encoder to get$\mu(x; \phi)$ and$\sigma(x; \phi)$ that specify$Q(z|x, \phi)$ - Sample
$\epsilon \sim N(0, I)$ so that$z_{\phi} = \mu(x; \phi)+\sigma(x; \phi) \odot \epsilon$ corresponds to a sample from the encoder distribution$Q(z|x, \phi)$ - Update the decoder using this
$z_{\phi}$ : $$ \theta = \theta + \eta \nabla_\theta \log P (x | z_\phi, \theta) $$ - Update encoder to match the stochastic ELBO criterion: $$ \phi = \phi + \eta \nabla_\theta {\log P(x | z_\phi, \theta) - KL(Q_{z|x; \phi} || P_z)} $$
- Deep generative models realize objects from latent randomization by approximating the probability space
$P(x | z) P(z)$ and drawing samples$z$ and$x$ from the probability space. - If we had the latent random vector
$z$ together with the image as pairs, optimization of the architecure would be easy as it would become a supervised learning task. However, since we don't have the pair, we resort to EM - EM is not tractable for these complex distribution problems so we need an encoder network to infer latent
$z$ variable (output of E step) - VAEs use an encoder network to infer
$z$ from$x$ and a decoder network to map$x$ back to$z$ . The decoder is also the generator network - VAE architecture is learned by maximizing a lower bound on the log likelihood of the data via the ELBO criterion.