Presented by Dr. Pulkit Agarwal, MIT, Nov 16 2021
Notes by Emily Liu
Much of current deep learning is performed according to an objective function designed by a human (MSE, cross entropy, VLB, etc). However, the issue is that these human-designed loss functions may not be ideal for the task. For example, image encoding/decoding tasks can achieve low loss but still produce images that lack detail when sampled from the latent space. The problem is not that the network is not performing well; it is that the loss function fails to capture the level of detail necessary. Thus introduces the need to be able to design a good objective function.
Recall that a generative adversarial network consists of a generator
To emphasize that we compare the difference between generator output to the true input, we will also include the input
The optimal generator $G^$ for the conditional GAN is given by $$ G^ = \argmin_G \max_D L_{cGAN}(G, D) + \lambda L_{L1}(G) $$
where
Conditional GANs (and other structured objective tasks) have been shown to outperform deep learning with unstructured objectives (such as least squares regression).
Flow models are another type of generative model in which we attempt to learn both the mapping of
For example, let
There is an easier way to compute the relationship between
We want to maximize the Jacobian while still maintaining both computational efficiency (as determinant calculation is cubic time) and expressivity (as making the matrix too simple nerfs the complexity of the mapping we would like to make).
To achieve this, we couple the layers. We split
The issue with this setup is that half the image is modeled well and the other half is modeled poorly. Therefore, we will need to perform this operation twice, switching
If we apply multiple of these mappings
Although flow models are not as crisp as GANs, they are useful in that they can provide us the exact likelihoods, while GANs are only able to provide lower bounds.
The logic behind the diffusion model is that if we gradually add small bits of noise to an input image, we will eventually get an isotropic Gaussian. If we are able to reverse these steps (using a neural network), then we can theoretically go from an isotropic Gaussian back to a real image.
To formalize, the image at each state is given by
We can express all
Then, we reparametrize using
To reverse this, we want to calculate
Note: because we know the
- Conditional GANs treat the discriminator as a structured learnable loss function that better fits the image generation task.
- Flow models learn forward encodings and their inverse decodings at the same time; layer coupling is used to ensure the Jacobian is upper triangular and therefore efficient to compute.
- Diffusion models add gradual amounts of noise to an image until it is an isotropic Gaussian and attempts to learn the reverse transformation.