Diffusion models: an introduction
Table of Contents
Introduction
Diffusion models seem to have taken the world by storm due to their amazing generative powers. That is not their only advantage, however. With a bit of a tongue in the proverbial cheek, statistical methods can be traditionally classified as either inflexible (classical stats), computationally expensive (MCMC), or non-analytical (boosted trees). From this perspective, diffusion models are a significant outlier since they are extremely flexible, provide access to the full posterior (and conditional) distributions, and are computationally less expensive than many of the competing methods.
General structure

Diffusion models are composed of two separate processes, the forward and the backward.
Forward diffusion
In general - a diffusion process can be characterized by a Markov diffusion kernel given a final state
Since it’s a length-one Markov process, we have for the full joint:
Usually, we use a Gaussian diffusion, for which the posterior is closed form (c.f. the 1-step update in Kalman filtering):
Moreover, due to the property of the Gaussian distribution:
meaning any state in the forward process can be expressed knowing just the initial one and the variance schedule. In general, the theoretical underpinning of Langevin dynamics guarantees any smooth distribution can be corrupted into Gaussian noise, meaning the initial distribution can be almost arbitrarily complex, giving the model its expressive power.
Reverse diffusion
The reverse process is characterized by a new transition probability
Like before:
For Gaussian (and binomial), the joint is still in the same family; however, there is no closed-form for the parameters
A natural training target is the likelihood of the original data
This is computationally intractable! A trick is to use annealed importance sampling - comparing the relative probability of the backward -
In the limit of very small
Training
We maximize the expected log likelihood of the original data
It can be lower bounded by a closed-form expression:
where
Additionally, conditioning the forward process posteriors on
The goal of training is therefore to estimate the reverse Markov transition densities
As mentioned, in the case of the Gaussian and binomial, the reverse process stays in the same family, therefore the task amounts to estimating the parameters.
Variance schedule
Since
Estimating
Our goal is to learn the following:
For the variance, it is simply set to be isotropic with the diagonal entries either fixed at a constant (either
The mean is, of course, learned in all implementations and proceeds as follows: by inspecting the
meaning we instead learn the estimator of the noise in the mean term at step
This can be further simplified by setting
Architecture
Ho et al. used a variation of an U-net called PixelCNN++ to estimate


How does it work in practice?
Example - recovery

Example - generation
Ho et al. used the modified network for both conditional and unconditional generation.
The unconditional generation was performed by estimating

For conditional generation, the authors selected a

Imagen & DALL-E 2
Ho and Salimans improved the above procedure by introducing the notion of guiding
the model during training on labeled data, i.e. estimating
This was used by Nichol et al. in GLIDE, which uses information extracted from text to do the guiding, combining a transformer with the previously described architecture.
This approach has been used to construct Imagen, which uses additional diffusion models to up-sample the image created by the guided diffusion process. The text embeddings are provided by a pretrained transformer’s encoder.

The other diffusion-based model to make waves recently - DALL-E 2 - uses a bit more complex approach:

First, it re-uses a model called CLIP (Contrastive Language-Image Pre-training) to construct a mapping between captions and images (top of the schematic above). In practice, the net result of this is a joint embedding of text and images in a representation space.
In generation, this model is frozen and a version of the GLIDE guided diffusion model is used to generate images starting from the image representation space (as opposed to random noise, for instance). As previously, additional up-samplers are used in the decoder, as well.
To generate from text prompts, we need to map the caption text embeddings to the above mentioned image embeddings, which are the starting point for the decoder. This is done with an additional diffusion model called the prior, which generates multiple possible embeddings. In other words, this is a generative model of the image embeddings given the text embeddings. The prior trains a decoder-only transformer to predict the conditional reverse process, as opposed to the U-net used in other examples.

References
- The original paper: Sohl-Dickstein et al. - Deep Unsupervised Learning using Non-equilibrium Thermodynamics
- Performant implementation: Ho et al. - Denoising Diffusion Probabilistic Models
- A good blog post: Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log.
- A summary of recent developments: https://maciejdomagala.github.io/generative_models/2022/06/06/The-recent-rise-of-diffusion-based-models.html
- DALL-E 2 initial paper: Ramesh et al. - Hierarchical Text-Conditional Image Generation with CLIP Latents
Related: