In my last post I ran through the forward process of Gaussian Diffusion. In this post I'll be breaking down the backward process. When I was running through this I found many steps were either omitted or just assumed to be known. I tried to fill in the blanks of other explanations as much as possible, eg putting the technique definition before it's use. I will still skip steps with algebra though as I think most readers will be familiar with that.

Step 1 — Why We Condition on $x_0$

We cannot directly compute the reverse distribution:

$$q(x_{t-1} \mid x_t)$$

because evaluating it requires integrating over all possible images. So, we "cheat" by conditioning on the clean image $x_0$, which makes the math tractable

$$q(x_{t-1} \mid x_t,\, x_0)$$

We will eliminate $x_0$ later by substituting in the noise prediction from our model.

Step 2 — Apply Bayes' Theorem

Bayes' theorem in three-variable form is:

$$P(A \mid B, C) = \frac{P(B \mid A, C)\cdot P(A \mid C)}{P(B \mid C)}$$

Set $A = x_{t-1}$, $B = x_t$, $C = x_0$. Then:

$$q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}, x_0)\cdot q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)}$$

The forward process is a Markov chain: each state $x_t$ depends only on the immediately preceding state $x_{t-1}$, not on any earlier history. Formally, for any $s < t-1$:

$$q(x_t \mid x_{t-1}, x_s) = q(x_t \mid x_{t-1})$$

So that allows us to drop $x_0$ because $x_{t-1}$ is known:

$$q(x_t \mid x_{t-1}, x_0) = q(x_t \mid x_{t-1})$$

The denominator $q(x_t \mid x_0)$ does not depend on $x_{t-1}$, so it is just a normalisation constant. Dropping it gives the proportionality:

$$q(x_{t-1} \mid x_t, x_0) \propto q(x_t \mid x_{t-1})\cdot q(x_{t-1} \mid x_0)$$

We use this method of proportionality to isolate the important terms, which makes things simpler.

Step 3 — Write the Gaussians

Recall from the forward process derivation that, using $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$:

$$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{\alpha_t}\,x_{t-1},\; (1-\alpha_t)\mathbf{I}\right)$$ $$q(x_{t-1} \mid x_0) = \mathcal{N}\!\left(x_{t-1};\; \sqrt{\bar{\alpha}_{t-1}}\,x_0,\; (1-\bar{\alpha}_{t-1})\mathbf{I}\right)$$

These are all Gaussians with known means and variances — exactly the form we need to multiply together in log-space. Why log-space ? This is one of those, "Because Computers" reasons. If we don't use log scale, gradients and values can either get really large or really small. Log-scale smooths things out and makes everything more stable.

Step 4 — Work in Log-Space

For a Gaussian, $\log \mathcal{N}(x;\,\mu,\,\sigma^2) \propto -\frac{(x-\mu)^2}{2\sigma^2}$. Taking the log of the proportionality from Step 2 and treating only the $x_{t-1}$-dependent parts:

$$\log q(\cdot) \propto -\frac{(x_t - \sqrt{\alpha_t}\,x_{t-1})^2}{2(1-\alpha_t)} -\frac{(x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}\,x_0)^2}{2(1-\bar{\alpha}_{t-1})}$$

Why can we omit the $1/(\sigma\sqrt{2\pi})$ part of the PDF ? Again proportionality, nothing depends on $x_{t-1}$ so it's just a constant.

Expand the Terms

Term 1 — expand the square in the numerator:

$$-\frac{\alpha_t\,x_{t-1}^2 - 2\sqrt{\alpha_t}\,x_t\,x_{t-1} + x_t^2}{2(1-\alpha_t)}$$

The $x_t^2$ piece does not involve $x_{t-1}$, so it folds into the constant we are ignoring.

Term 2 — expand similarly:

$$-\frac{x_{t-1}^2 - 2\sqrt{\bar{\alpha}_{t-1}}\,x_0\,x_{t-1} + \bar{\alpha}_{t-1}\,x_0^2}{2(1-\bar{\alpha}_{t-1})}$$

Again the $x_0^2$ piece is constant in $x_{t-1}$ and can be dropped.

Collect Terms in $x_{t-1}$

Grouping the $x_{t-1}^2$ and $x_{t-1}$ coefficients:

$$\propto -\frac{1}{2} \left( \frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}} \right)x_{t-1}^2 + \left( \frac{\sqrt{\alpha_t}\,x_t}{1-\alpha_t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\,x_0}{1-\bar{\alpha}_{t-1}} \right)x_{t-1}$$

Recall that the full log of a Gaussian is:

$$\log \mathcal{N}(x;\,\mu,\,\sigma^2) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$$

Expanding the square and dropping all terms that do not depend on $x$:

$$\propto -\frac{x^2}{2\sigma^2} + \frac{\mu\,x}{\sigma^2} \;=\; -\frac{1}{2\sigma^2}x^2 + \frac{\mu}{\sigma^2}x$$

This is exactly the log of a Gaussian in $x_{t-1}$. We can read off the variance and mean by matching to the standard form $-\frac{1}{2\sigma^2}x^2 + \frac{\mu}{\sigma^2}x$.

Step 5 — Posterior Variance

Look at what the collected expression gives us for the $x_{t-1}^2$ coefficient:

$$-\frac{1}{2}\left(\frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}}\right)x_{t-1}^2$$

In the standard Gaussian log-form, that coefficient is $-\frac{1}{2\sigma^2}$, so the whole bracket equals $\frac{1}{\sigma^2}$ reciprocal variance, not the variance itself. That is why we write $\frac{1}{\tilde{\beta}_t}$ rather than $\tilde{\beta}_t$. The algebra hands us the reciprocal variance, and we will invert at the end to get the variance. This confused me for a bit which is why I emphasise it.

Matching the $x_{t-1}^2$ coefficient to $\frac{1}{\tilde{\beta}_t}$:

$$\frac{1}{\tilde{\beta}_t} = \frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}}$$

Combining over a common denominator:

$$\frac{1}{\tilde{\beta}_t} = \frac{\alpha_t(1-\bar{\alpha}_{t-1}) + (1-\alpha_t)}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}$$

The numerator simplifies using $\bar{\alpha}_t = \alpha_t\bar{\alpha}_{t-1}$:

$$\alpha_t - \alpha_t\bar{\alpha}_{t-1} + 1 - \alpha_t = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t$$

Inverting gives the posterior variance:

$$\boxed{ \tilde{\beta}_t = \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} }$$

Since $1-\alpha_t = \beta_t$, this can also be written as $\tilde{\beta}_t = \frac{\beta_t(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}$.

Step 6 — Posterior Mean

Matching the $x_{t-1}$ coefficient to $\frac{\tilde{\mu}_t}{\tilde{\beta}_t}$

$$\frac{\tilde{\mu}_t} {\tilde{\beta}_t} = \left( \frac{\sqrt{\alpha_t}\,x_t}{1-\alpha_t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\,x_0}{1-\bar{\alpha}_{t-1}} \right)$$

Multiply by $\tilde{\beta}_t$

$$\tilde{\mu}_t = \tilde{\beta}_t \left( \frac{\sqrt{\alpha_t}\,x_t}{1-\alpha_t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\,x_0}{1-\bar{\alpha}_{t-1}} \right)$$

Substituting in $\tilde{\beta}_t$ and simplifying each term:

$$\tilde{\mu}_t = \frac{(1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}}{1-\bar{\alpha}_t}\,x_t + \frac{(1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_t}\,x_0$$

This is the posterior mean as a weighted combination of $x_t$ and $x_0$. We need to eliminate $x_0$ as we do not have it at inference time. $x_0$ is actually the result that inference is working towards.

Step 7 — Eliminate $x_0$

From the closed-form forward process derived in Part 1:

$$x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon$$

Solving for $x_0$:

$$x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon}{\sqrt{\bar{\alpha}_t}}$$

Substituting this into the expression for $\tilde{\mu}_t$ and collecting terms (the algebra is tedious, the $x_t$ terms from both parts combine cleanly):

$$\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon \right)$$

Final Result

Using $1 - \alpha_t = \beta_t$ and replacing the true noise $\varepsilon$ with the model's noise prediction $\varepsilon_\theta(x_t, t)$:

$$\boxed{ \tilde{\mu}_t(x_t,\,\varepsilon) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon \right) }$$ $$\boxed{ \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t,\,t) \right) }$$

Combined with the posterior variance $\tilde{\beta}_t$, this defines the full reverse step distribution:

$$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \tilde{\mu}_t,\; \tilde{\beta}_t\,\mathbf{I}\right)$$

Putting It Together: The Denoising Loop

At each reverse step $t$ from $T$ down to $1$:

  1. Predict the noise: run the neural network forward to get $\varepsilon_\theta(x_t, t)$.
  2. Compute the mean: $$\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t,t)\right)$$
  3. Sample the next state: $$x_{t-1} \sim \mathcal{N}(\tilde{\mu}_t,\;\tilde{\beta}_t\,\mathbf{I})$$ (At $t = 1$ the stochastic noise term is usually dropped and you just take $x_0 = \tilde{\mu}_1$.)

Conclusion

Starting from Bayes' theorem and the Gaussian distributions established in Part 1, we derived the complete posterior $q(x_{t-1} \mid x_t, x_0)$: a Gaussian with analytically tractable mean $\tilde{\mu}_t$ and variance $\tilde{\beta}_t$.