Learning the integral of a diffusion model

S sander.ai ↗

▲ 161 points • 23 comments • by benanne • 3w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

1 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,828

PEAK AI % 1% · §2

Analyzed

May 6

backend: pangram/v3.3

Segments scanned

5 windows

avg 366 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,828 words · 5 segments analyzed

Human AI-generated

§1 Human · 0%

Sampling from a diffusion model is an iterative process: at each step, the denoiser estimates the tangent direction to a path through input space. We move along this path by repeatedly taking small steps in this direction, effectively calculating an integral across noise levels. This gradually transforms samples from a simple noise distribution into samples from a target distribution, and traces out the path that connects them. Can we train neural networks to directly predict this integral instead, in order to speed up sampling? Yes we can – welcome to the world of flow maps!Ever since the rise of diffusion models, people have sought ways to make them faster and cheaper to sample from. About two years ago, I wrote a blog post about diffusion distillation, which is one of the main tools used to reduce the number of steps required to obtain high-quality samples. Although the core principles underlying various distillation methods have not changed, a lot of new variants have popped up since.In this blog post, I want to take a closer look at flow maps. While diffusion models describe paths between noise and data by predicting the tangent direction at each point along the path, flow maps are instead able to predict any point on a path from any other point on that same path. They can be used for faster sampling, but they also have some other tricks up their sleeve, enabling more efficient reward-based learning and improved sampling steerability, among other things. They have recently become a very popular subject of study.While it is relatively straightforward to define what a flow map is, there turn out to be many different ways to build and train them. On top of that, as with diffusion itself, the literature is once again rife with different formalisms and terminology, which makes for a confusing experience when trying to learn how everything fits together. I will do my best to clear things up a bit, based primarily on the taxonomy proposed by Boffi et al.1 2.Flow maps build on the ideas behind diffusion models, and as usual, I will assume some familiarity with these ideas. Being comfortable with vector calculus will also help to understand how they are trained, but if that’s not you, hopefully the other parts of this blog post will still be interesting to you. You may want to consider (re-)reading some of my earlier blog posts for context (e.g. Perspectives on diffusion).

§2 Human · 1%

Alternatively, Chieh-Hsin Lai and colleagues recently published a comprehensive monograph on diffusion models3, which combines math and rigour with intuitive explanations – highly recommended, both as a refresher and as a starting point.Below is a table of contents. Click to jump directly to a particular section of this post. Charting paths from noise to data Three notions of consistency To backprop or not to backprop? Training flow maps from scratch Flow maps in practice Applications and extensions Alternative strategies Closing thoughts Acknowledgements References Charting paths from noise to data The key to understanding flow maps is the perspective of diffusion models as defining a bijection between noise and data, with unique paths connecting pairs of samples from each distribution, in such a way that they never cross each other. Therefore, let’s first take a closer look at diffusion sampling algorithms, and build towards flow maps from there. Sampling from diffusion modelsThere are many different sampling algorithms available for diffusion models nowadays, but they all fall into one of two categories: stochastic or deterministic. The miracle of deterministic sampling is something I have written about before, but it is worth recapping here, as it is fundamental to the development of flow maps.The gist of it is as follows: if we have a denoiser model that predicts the expected value of the clean original data \(\hat{\mathbf{x}}_0 = \mathbb{E}\left[ \mathbf{x}_0 \mid \mathbf{x}_t \right]\), given a noisy observation \(\mathbf{x}_t\), we can construct two distinct iterative generative procedures.The stochastic one is the most intuitive: at each iteration, we sample from a conditional distribution of slightly less noisy examples, given the current noisy observation, \(p(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\), to reverse the corruption process one step at a time. Conveniently, we can construct an approximation of this distribution using the denoiser model prediction \(\hat{\mathbf{x}}_0\). The smaller the interval between the noise levels at time steps \(t\) and \(t-1\), the more accurate the approximation will be. After many iterations, the noise fades, and we end up with a sample from the clean data distribution at \(t=0\). This is, in a nutshell, how the original DDPM4 algorithm works. Sampling algorithms based on the stochastic differential equation (SDE) formalism of diffusion models5 produce similar stochastic trajectories in input space.

§3 Human · 1%

The deterministic procedure does not involve drawing random samples at any point, except at the very start: given the current noisy observation \(\mathbf{x}_t\) and the prediction \(\hat{\mathbf{x}}_0\) from the denoiser, there is a deterministic update rule that gives us \(\mathbf{x}_{t-1}\), which we can recursively apply until we get to \(\mathbf{x}_0\). Because every step of the procedure is deterministic, there is no randomness anywhere: from a given starting point \(\mathbf{x}_t\), we can only ever end up in one specific end point \(\mathbf{x}_0\). Such an update rule can be derived in the probabilistic framework (i.e. DDIM6), or using the ordinary differential equation (ODE) formalism5.The default sampling algorithm used in Flow Matching7 is another instance of the deterministic procedure. Here, the neural network is typically parameterised to predict the velocity \(\mathbf{v}_t = \mathbb{E}\left[\mathbf{x}_T - \mathbf{x}_0 \mid \mathbf{x}_t \right]\) instead of the clean input \(\mathbb{E}\left[ \mathbf{x}_0 \mid \mathbf{x}_t \right]\) (with \(t=T\) the time step corresponding to the maximal noise level, i.e. pure Gaussian noise). However, as there is a linear relationship between \(\mathbf{v}_t\), \(\hat{\mathbf{x}}_0\) and \(\mathbf{x}_t\), this just yields a variant of the same underlying algorithm (see also this discussion of different diffusion model output parameterisations in an earlier blog post).All these algorithms have in common that the marginal distributions of noisy examples \(p(\mathbf{x}_t)\) at each time step \(t\) are preserved: the distribution of \(\mathbf{x}_t\) does not depend on whether you chose to use a deterministic or stochastic sampling algorithm! This is of course not true at all for the conditional distributions \(p(\mathbf{x}_t \mid \mathbf{x}_T)\), which collapse to delta distributions in the deterministic case (all probability mass is on a single option). This preservation of the marginal distributions is also true for the special cases \(p(\mathbf{x}_0)\) and \(p(\mathbf{x}_T)\), at the data and noise sides respectively. If we look at specific individual examples rather than distributions, however, the path in input space traced out by the sampling process will look quite different.

§4 Human · 1%

Below is a visualisation of the sampling process: stochastic on the left, deterministic on the right. I decided to show this for both a 1D example (top) and a 2D example (bottom), because I believe the insights they provide are complementary. In both cases, the target distribution is a mixture of two Gaussians. We start with samples from our noise distribution, which is a single Gaussian. As sampling progresses, the distribution gradually transforms into the target mixture. The path a single sample traverses is quite jagged and erratic in the stochastic case, but smooth and gently curved in the deterministic case. Two very different microscopic behaviours give rise to the exact same macroscopic behaviour! Visualisation of stochastic (left) and deterministic (right) diffusion sampling for a mixture of two Gaussians in 1D (top) and 2D (bottom). Stochastic algorithms produce jagged sample paths, deterministic algorithms provide a smoother ride. Dead reckoning: tracking paths with a diffusion modelAn important implication of the existence of deterministic sampling algorithms is that there must be a deterministic bijective mapping between individual samples from the noise and data distributions. Each noise sample is associated with a single specific data sample, and vice versa. Starting from a noise sample, we can follow a path through input space that leads us to the corresponding data sample. We do this simply by following the tangent direction to the path at each point, which is predicted by the denoiser. Note that we can also use the same tangent direction to guide us along the path in reverse, from data to noise.The diagram below shows a sample from the noise distribution \(\mathbf{x}_T\), the corresponding data sample \(\mathbf{x}_0\), the path through input space connecting them, and an intermediate point on the path \(\mathbf{x}_t\). It also shows the denoiser prediction \(\hat{\mathbf{x}}_0\) at this point, which corresponds to the tangent direction to the path. If you’ve read my previous posts on the geometry of guidance or distillation, you will probably be familiar with this type of diagram. The former post also contains a warning about the dangers of representing high-dimensional objects in 2D, which bears repeating: great care should be taken when drawing conclusions from 2D intuitions!

§5 Human · 0%

Diagram showing a noise sample, the corresponding data sample, the path connecting them, an intermediate point on the path and the denoiser prediction at that point, tangent to the path. Using denoiser predictions to traverse these paths is memoryless: the only inputs to the denoiser are the current position in input space and the current noise level, from which it predicts a direction to move in, \(\hat{\mathbf{x}}_0 = f(\mathbf{x}_t, t)\). It is also myopic: the denoiser doesn’t get to peek ahead at the eventual destination \(\mathbf{x}_0\), it just says where to go next. It is not able to use any other information: no previously visited positions or previously predicted directions, no start- or endpoints, just where we are currently in the sampling process, and nothing else. This way of characterising paths brings to mind navigation through dead reckoning.It follows that the path between a specific pair of noise and data samples that are connected in this way must be unique: if there were more than one path leading to a particular data sample, there would be multiple valid tangent directions at the point where these paths separate from each other. For the same reason, paths between different pairs of samples can never cross each other, because that would introduce ambiguity at the crossing point. It is not possible for the denoiser to distinguish between multiple crossing paths, because it only knows its current position, not which path it is on. This is shown in the diagram below. Diagram showing a hypothetical alternative path passing through the same intermediate point, which creates ambiguity about the tangent direction. Technically, this argument only demonstrates that paths cannot cross in \((\mathbf{x}_t, t)\)-space, but they could still cross in \(\mathbf{x}_t\)-space in theory, if the two paths in question arrive at the same point in input space at different time steps \(t\). In practice, we can ignore this edge case, because the distributions of noisy intermediate samples \(p(\mathbf{x}_t)\) for two sufficiently different time steps will have basically no overlap. In fact, some recent papers8 9 suggest that not feeding the current noise level into the denoiser often works just as well or even better, because in a high-dimensional input space, it is able to infer the noise level from \(\mathbf{x}_t\) itself.