I've beening learning all about diffusion models, and my goodness, the math took me by suprise lol. I'm writing this to reinforce what I learnt and to help anyone else working through the same material.

The General Idea

The core idea behind diffusion models is surprisingly straightforward: you train a neural network to predict a noise term drawn from a Gaussian distribution. Concretely, you corrupt an input (an image, say) by gradually adding noise over $T$ steps, then train the network to reverse that process — predicting and removing noise step by step until you recover something clean.

The Forward Process

The forward process runs from $t = 0$ to $t = T$ (typically $T = 1000$), slowly corrupting the input into pure Gaussian noise. The tricky part is that we need to be able to compute $x_t$ — the noisy image at an arbitrary time step $t$ — without running the process step-by-step from $x_0$.

This contrasts with the naive approach of just progressively adding noise. Because Gaussians are closed under addition, you could do that. But during training you need to jump to any arbitrary $x_t$ directly from $x_0$, so you need a closed-form expression. That is what the math below builds toward.

Notation: The Gaussian

A Gaussian (normal) distribution is written as:

$$\mathcal{N}(\mu,\, \sigma^2)$$

Let $\beta_t$ be a small real number that grows as $t$ increases. The rate at which $\beta_t$ changes is called the noise schedule (more on that later).

The forward transition — the distribution of $x_t$ given $x_{t-1}$ — is defined as:

$$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t \mathbf{I}\right)$$

This gives you the next noisy image $x_t$ given the previous one, $x_{t-1}$.

The Sampling Form

The familiar PDF form is:

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

That form is unwieldy when we want to isolate the randomness. But where does the sampling form actually come from? The answer is the change of variables formula.

Start with a standard normal $z \sim \mathcal{N}(0, 1)$ and define $x = \mu + \sigma z$. To find the distribution of $x$, we use the change of variables formula. For a 1D transformation $x = g(z)$:

$$p_x(x) = p_z(z) \left|\frac{dz}{dx}\right|$$

Inverting $x = \mu + \sigma z$ gives $z = \frac{x - \mu}{\sigma}$, so the derivative is:

$$\frac{dz}{dx} = \frac{1}{\sigma}$$

Substituting both into the formula:

$$p_x(x) = p_z\!\left(\frac{x-\mu}{\sigma}\right) \cdot \left|\frac{1}{\sigma}\right|$$

Now expand using the standard normal PDF $p_z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$:

$$p_x(x) = \frac{1}{\sqrt{2\pi}} \exp\!\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\right) \cdot \frac{1}{\sigma}$$

Rearranging (assuming $\sigma > 0$):

$$p_x(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

That is exactly $\mathcal{N}(\mu, \sigma^2)$. So the sampling form is not a trick — it is a direct consequence of change of variables. Sampling $z \sim \mathcal{N}(0,1)$ and computing $x = \mu + \sigma z$ gives you a sample from $\mathcal{N}(\mu, \sigma^2)$. This is the equivalent sampling form:

$$x = \mu + \sigma\,\varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, 1)$$

Plugging our transition parameters into the sampling form gives:

$$x_t = \sqrt{1 - \beta_t}\; x_{t-1} + \sqrt{\beta_t}\; \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \mathbf{I})$$

Note that the noise term is scaled by $\sqrt{\beta_t}$, not $\beta_t$ — because $\beta_t$ is the variance, and we need the standard deviation to scale the noise sample. Why both coefficients must be square roots specifically is covered in the variance preservation section below.

Substitution: Introducing $\alpha_t$

To clean things up, define:

$$\alpha_t = 1 - \beta_t$$

The transition becomes:

$$x_t = \sqrt{\alpha_t}\; x_{t-1} + \sqrt{1 - \alpha_t}\; \varepsilon$$

This is the same equation — just written more compactly.

The Key Insight: Any $x_t$ Directly from $x_0$

The equation above gives $x_t$ from $x_{t-1}$ — one step back. What we want is $x_t$ directly from $x_0$, so we can sample any noise level in a single operation during training. Let's expand a few steps and look for the pattern.

$t = 0$: $\quad x_0 = x_0 \quad$ (the original input image)

$t = 1$:

$$x_1 = \sqrt{\alpha_1}\; x_0 + \sqrt{1 - \alpha_1}\; \varepsilon_1$$

$t = 2$:

$$x_2 = \sqrt{\alpha_2}\; x_1 + \sqrt{1 - \alpha_2}\; \varepsilon_2$$

Now substitute $x_1$ into the expression for $x_2$ and expand:

$$x_2 = \sqrt{\alpha_1 \alpha_2}\; x_0 + \underbrace{\sqrt{\alpha_2}\sqrt{1-\alpha_1}\;\varepsilon_1 + \sqrt{1-\alpha_2}\;\varepsilon_2}_{\text{linear combination of Gaussians}}$$

The remaining terms (after $x_0$) don't depend on $x_0$ — they are a linear combination of independent Gaussians. By the closure property of Gaussians:

A sum of independent Gaussians is itself a Gaussian.

Those terms collapse into a single noise term. The only thing that matters for the pattern is the coefficient on $x_0$. Continuing:

  • $t = 2$: coefficient is $\sqrt{\alpha_1 \alpha_2}$
  • $t = 3$: coefficient would be $\sqrt{\alpha_1 \alpha_2 \alpha_3}$
  • $t = k$: coefficient is $\sqrt{\alpha_1 \alpha_2 \cdots \alpha_k}$

Define the cumulative product:

$$\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$$

Then the closed-form expression for $x_t$ given $x_0$ is:

$$\boxed{x_t = \sqrt{\bar{\alpha}_t}\; x_0 + \sqrt{1 - \bar{\alpha}_t}\; \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \mathbf{I})}$$

This is the key result of the forward process. Given any input image $x_0$, you can sample the noisy version at any time step $t$ in a single operation — no need to iterate through all previous steps. As $t \to T$, $\bar{\alpha}_t \to 0$ and the image becomes pure noise; as $t \to 0$, $\bar{\alpha}_t \to 1$ and you recover the original image.

Why the Square Roots? Variance Preservation

Both coefficients are square roots because of a requirement called variance preservation. We want each forward step to maintain unit variance — if $\operatorname{Var}(x_{t-1}) = 1$, we need $\operatorname{Var}(x_t) = 1$ as well. This keeps the signal scale stable as noise accumulates.

Starting from $x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{1 - \alpha_t}\, \varepsilon$, and using $\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)$ for independent terms:

$$\operatorname{Var}(x_t) = \left(\sqrt{\alpha_t}\right)^2 \operatorname{Var}(x_{t-1}) + \left(\sqrt{1 - \alpha_t}\right)^2 \operatorname{Var}(\varepsilon)$$ $$= \alpha_t \cdot \operatorname{Var}(x_{t-1}) + (1 - \alpha_t) \cdot \operatorname{Var}(\varepsilon)$$

Now assume $\operatorname{Var}(x_{t-1}) = 1$ (by induction, with $x_0$ being normalised data), and recall $\varepsilon \sim \mathcal{N}(0, \mathbf{I})$ so $\operatorname{Var}(\varepsilon) = 1$:

$$\operatorname{Var}(x_t) = \alpha_t \cdot 1 + (1 - \alpha_t) \cdot 1 = 1 \checkmark$$

Variance is perfectly preserved at every step. To see why the square roots are necessary, consider what happens without them — say we used $x_t = \alpha_t\, x_{t-1} + (1 - \alpha_t)\, \varepsilon$ instead:

$$\operatorname{Var}(x_t) = \alpha_t^2 + (1 - \alpha_t)^2 \neq 1$$

The squared terms no longer cancel to 1, and the signal scale drifts across steps. The square roots are precisely what ensures the two coefficients sum to 1 after squaring.

Conclusion

This covers the math behind the forward diffusion process — how to corrupt data with Gaussians, why the square roots are necessary for variance stability, and how to derive the closed-form $q(x_t \mid x_0)$ that makes training tractable. The next part is the reverse process, which is also math-heavy, and will come in the next article.