Anish Diwan
- Dec 7, 2023
- 9 min read

Energy-Based Models, Score Matching & Diffusion

Generative modelling is a class of machine learning methods that aims to generate new (previously unseen) information by learning the underlying distribution of some given unlabelled data. These notes aim to explain the background work and intuition behind implicit generative models using examples and animations.

Let us say that you are a cat fanatic and are trying to generate images of cats to fuel your never-ending need to see a new cat every day. First, let's highlight a rather big assumption that you make about pictures of cats. You assume that in the space of all possible images of size `N`x`N` , `mathbb "R"^{N^2}` (for a `N`x`N` image this space is `N^2` dimensional), all images of cats lie in some unknown subspace `X` and there is some probability density function that describes the probability of sampling an image vector `x in mathbb "R"^{N^2}` in `X`. Generative models can generate new pictures of cats by essentially learning this probability density function and then sampling image vectors that have high probability of coming from `X`.

Description: The - 2D projection of the - subspace of cat images `X` where each point is a vector representing a cat picture. The probability density is implicit in the density of the actual data in the space (regions with more samples together are high probability and the white space is essentially zero probability)

There are two main families of generative models. Ones that directly learn the underlying probability density of the data (likelihood-based methods) and ones that learn the sampling process from the underlying probability density of the data (implicit generative models) [1]. This article focuses on the latter kind.

Boltzmann Distribution & Score-Based Models

Before we continue the discussion on generating cat pictures, we first need to take a quick detour into statistical physics. Let's say that I have a balloon filled with atoms of some gas (say it's Helium so it exists as a bunch of inert He atoms) at temperature `K`. I know that there are `N` atoms in the balloon. The temperature of a gas in a volume is the average kinetic energy of all atoms in that volume (for now let's make the gross assumption that temperature and kinetic energy have the same units so I will be using these two terms interchangeably). So, knowing that the average energy is `K`, I want to know what is the energy of each atom in this balloon. This might be very very difficult to compute exactly. What if I pose this problem as what is the probability that a certain atom in this balloon has energy `E(He)`. This is much easier to compute and the probability distribution of the energy of atoms in a gas is given by the Maxwell-Boltzmann Distribution.

Description: Probability distribution for an energy function (here, the kinetic energy of atoms in a balloon)

The Boltzmann distribution is a probability distribution that gives the probability that a system will be in a certain state, given the "energy" of the state. A general form of this for any (even non-physical) system is the Gibbs measure.

`P(X = x) = \frac{e^{-E(x)}}{Z}`

Here, `P` is the probability that some vector `x in X` has energy `E(x)`. `Z` is a normalising factor that makes sure that the probability `P` of all such samples `x` adds up to 1.

`\int P(x) dx = 1`

But why bother with the energy function? It can be argued that learning such an energy function for a dataset of samples is more tractable than learning the probability density function (this argument is solidified mathematically in the next subsection). We can assume that there exists some energy function that assigns low energy to vectors of cat images `x` in our subspace of cat images `X` and high energy to any other sample. Say, the energy function `E(x)` looks something like the following where dark regions indicate high energy (more likely that a sample does not exist here). The idea is that it is necessary for the probability density function `p(x)` to add to 1 across all samples `x in X`. While no such requirement exists for an energy function. It is hence easier to learn this energy function than it is to learn the probability density function.

Description: Energy function for some hypothetical 2D projection of a space of cat images

Score-Based Models

The following excerpt is taken from [1] as I think it very clearly explains the mathematics of score-based models. Suppose we are given a dataset `{ x_1, x_2, x_3, .. , x_N }` where each point is drawn independently from an underlying data distribution `p(x)`. Given this dataset, the goal of generative modelling is to fit a model to the data distribution such that we can synthesize new data points at will by sampling from the distribution. To build such a generative model, we first need a way to represent a probability distribution. One such way, as in likelihood-based models, is to directly model the probability density function (p.d.f.) or probability mass function (p.m.f.).

Let `f_{theta}(x) in R` be a real-valued function (think of this as an energy function from the Gibbs measure, also called an unnormalised probabilistic model) parameterized by a learnable parameter `theta`. We can define the p.d.f of data points with energy `f_{theta}(x)` using the Gibbs measure as

`p_{theta}(x) = \frac{e^{-f_{theta}(x)}}{Z_{theta}}` where, `Z_{theta} > 0` is the normalising constant. Likelihood-based methods find these parameters `theta` by maximising the log-likelihood of the data `{ x_1, x_2, x_3, .. , x_N }` (more on this in the appendix)

`\text{argmax}_{theta} sum_{i=1}^{N} log p_{theta(x_i)`

However, maximising the log-likelihood requires the probability distribution `p_{theta}(x)` to be normalised. Figuring out this normalising factor `Z_{theta}` is typically intractable. This means that likelihood-based models are often restricted. This is where score-based models come in. We define the score as the gradient of the log probability `p_{theta}(x)`

`\text{score function} = nabla_{x} log(p(x)) = nabla_{x} log (\frac{e^{-f_{theta}(x)}}{Z_{theta}}) = nabla_{x}(-f_{theta}(x) - log (Z_{theta})) = -nabla_{x} f_{theta}(x) - 0 = -nabla_{x}f_{theta}(x)`

(gradient of normalizing factor w.r.t `x` is zero as it is only a function of `theta`).

To provide further intuition behind the score function, consider a single data sample from the cat distribution that has some amount of noise added to it. The noisy sample is shown below in green and its original position in the data distribution is shown in black. Further, the underlying energy function is shown as a contour in the background with light colours indicating low energy. As shown above, the gradient of the log probability of the sample with respect to the sample is the negative gradient of the energy of the sample with respect to the sample. This gradient is the direction of the steepest increase in the log probability which is also the direction of the steepest descent of the sample's energy. In essence, the score is a denoising vector -- which is the gradient of the log probability of the sample which is also the negative gradient of energy of the sample (w.r.t the sample) -- that returns the noisy sample back to its position in the distribution.

Description: A figure showing the score function

By modelling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants. This means that a score-based model is independent of the normalising factor! This significantly expands the family of models that we can tractably use, since we don’t need any special architectures to make the normalizing constant tractable. Similar to likelihood-based models, we can train score-based models by minimizing the Fisher divergence between the model and the data distributions. If `s_{theta}(x)` is the score-based model (which learns the score function to return the score for a sample `x` with energy `f_{theta}(x)`), we learn the score function by minimising `mathbb "E"_{p_{x}} [\text{L2}(nabla_{x}p(x), s_{theta}(x))]`. Of couse, we don't know the true distribution `p(x)` (otherwise this whole thing is futile). We hence use score-matching to figure out the optimisation target. Score matching objectives can directly be estimated on a dataset and optimized with stochastic gradient descent, analogous to the log-likelihood objective for training likelihood-based models (with known normalizing constants). We can train the score-based model by minimizing a score matching objective, without requiring adversarial optimization.

Score Matching

So far we have seen that the score is the gradient of the log probability of a sample which is the negative gradient of the energy function. We aim to learn this score parameterised by parameters `theta`. To do this we can minimise the Fisher divergence between the gradient of the log probability w.r.t the data sample and the output of our model (which is the negative gradient of the energy).

`s_{theta}(x)` = what we learn using gradient descent on the Fisher divergence (we learn a "good" `theta`) = `-nabla_{x}f_{theta}(x)`. The Fisher divergence is given by

`\frac{1}{2} mathbb "E"_{p_{data}} [norm(nabla_{x}log(p_{data}(x)) - nabla_{x}log(p_{theta}(x)))^2]`

Intuitively, this means that given that you draw samples from a data distribution, you want to minimise the expectation (sum of some quantity over all samples) of the square of the L2 norm (distance) of the log probability of that sample having come from the data distribution and the log probability of that sample having come from your learnt distribution. You do this minimisation by tweaking your `theta`. In the end, you learn "good" parameters `theta` such that the Fisher divergence is very close to zero.

But hold on, the Fisher divergence takes in the gradient of the log probability of the data distribution. We don't know the data distribution! If we did then we wouldn't have to do this learning business.

Denoising Score Matching

Denoising score matching (DSM) is an analogy between denoising autoencoders and the score function. In general, a denoising autoencoder aims to convert a noisy sample `tilde x` back to its original form `x` by learning the added noise. Just like the example in the previous figure, say we have perturbed a sample from our distribution by adding some amount of Gaussian noise controlled by some variance.

`tilde x = x + epsilon` where `epsilon ~ mathcal "N"(0, sigma^2 . mathbb "I")`

The score matching objective (Fisher divergence) can then be defined with respect to this noisy distribution as

`\frac{1}{2} mathbb "E"_{p_{tilde x}} [norm(nabla_{tilde x}log(p_{tilde x}(tilde x)) - s_{theta}(tilde x))^2]`

[3] shows that this objective can be rewritten in terms of the conditional probability of sampling `tilde x` given x. The objective for DSM is hence

`\frac{1}{2} mathbb "E"_{p_{tilde x, x}} [norm(nabla_{tilde x}log(p_{tilde x, x}(tilde x | x)) - s_{theta}(tilde x))^2]`

The benefit of this is that while `nabla_{tilde x} log p_{tilde x}(tilde x)` is complex and can not be evaluated analytically , `p_{tilde x, x}(tilde x | x)` is a normal distribution which can be analytically evaluated. Intuitively, the gradient corresponds to the direction of moving `tilde x` from back to the original `x` (i.e., denoising it), and we want our score-based model `s_{theta}(tilde x)` to match that as best as it can.

Sliced Score Matching

Apart from DSM, score matching can also be done only with the energy function. This results from a mathematical trick that shows that you don't need to know the data distribution to minimise Fisher divergence and leads to a variant of score matching called sliced score matching. Here's the derivation for the score matching objective for 1D samples [2].

Before we get started, let me lay down some known facts. From the previous section, we know that the gradient of the log probability of a sample (either from the learnt or data distribution) is the score that we try to learn. `text{score} = nabla_{x}log(p_{theta}(x))`

We also know that for a continuous random variable, the expectation over some data distribution of that random variable is the integral of that random variable times the function defining the "value" we want to compute. `mathbb E_{x ~ p_{data}(x)} [f(x)] = int x.f(x)dx`

From integration by parts we know that `int f(x).g'(x) = f(x).g(x) + int f'(x).g(x)`

Lastly, we know the `nabla_{x}log f(x) = \frac{1}{f(x)} . nabla_{x} f(x)`

Let's get started by rewriting the Fisher divergence equation

`\frac{1}{2} mathbb "E"_{p_{data}} [norm(nabla_{x}log(p_{data}(x)) - nabla_{x}log(p_{theta}(x)))^2]`

Expanding the expectation

`= \frac{1}{2} int p_{data}(x) (nabla_{x}log(p_{data}(x)) - nabla_{x}log(p_{theta}(x)) )^2 dx`

Expanding `(a - b)^2` and splitting the integral between the individual components

` = \frac{1}{2} int p_{data}(x) (nabla_{x}log p_{data}(x))^2 dx` `+` `\frac{1}{2} int p_{data}(x) (nabla_{x}log p_{theta}(x))^2 dx` `-` `2 . \frac{1}{2} int p_{data}(x) nabla_{x}log p_{data}(x) . nabla_{x}log p_{theta}(x) dx`

The first section part of this expression is only a function of `p_{data}(x)` which will stay constant and hence has no effect on the minimisation of the Fisher divergence. We can disregard this. The second part of the expression can be converted back to an expectation `\frac{1}{2} mathbb "E"_{p_{data}} [(nabla_{x}log p_{theta}(x))^2]`. The third part can be integrated by parts.

`int p_{data}(x) . nabla_{x}log p_{data}(x) . nabla_{x}log p_{theta}(x) . dx`

Rewriting the gradient of log probability of `p_{data}(x)`

`= int p_{data}(x) . \frac{1}{p_{data}(x)} . nabla_{x}p_{data}(x) . nabla_{x}log p_{theta}(x) . dx`

`= int nabla_{x}log p_{theta}(x) . nabla_{x}p_{data}(x) . dx`

Integrating by parts

`= p_{data}(x) . nabla_{x}log p_{theta}(x) |_{-oo}^{oo}` `+` `int p_{data}(x) . nabla_{x}^2 log p_{theta}(x) dx`

Assuming that as x tends to infinity `p_{data}(x)` tends to zero, we can cancel out the first part. We can then turn the second part back to an expectation.

`= mathbb "E"_{p_{data}} [nabla_{x}^2 log p_{theta}(x)]`

Putting everything together, the Fisher divergence comes out as

`\frac{1}{2} mathbb "E"_{p_{data}} [(nabla_{x}log p_{theta}(x))^2] + mathbb "E"_{p_{data}} [nabla_{x}^2 log p_{theta}(x)] + text{const}`

We see that this expression does not depend on `nabla_{x} log p_{data}(x)` and can be minimised tractably. When working with multi-dimensional data, we need to look at the trace of the Hessian which is more computationally expensive. Sliced score matching builds on this gap. In reality, most methods have the network directly predict the score (or a denoising vector) or learn the energy function and then explicitly compute its gradient. This is then compared to some artificially added noise using DSM.

Langevin Dynamics & Diffusion

References

[1] Song Y. Generative Modeling by Estimating Gradients of the Data Distribution | 2021 May [cited 2023 Dec 6]. Available from: https://yang-song.net/blog/2021/score/

[2] Song Y. Sliced Score Matching: A Scalable Approach to Density and Score Estimation | 2019 July [cited 2023 Dec 6]. Available from: https://yang-song.net/blog/2019/ssm/

Anish Diwan