Diffusion #2: The Math Of The Gaussian Backward Pass
In my last post I ran through the forward process of Gaussian Diffusion. In this post I'll be breaking down the backward process. When I was running through this I found many steps were either omitted or just assumed to be known. I tried to fill in the blanks of other explanations as much as possible, eg putting the technique definition before it's use. I will still skip steps with algebra though as I think most readers will be familiar with that.
Step 1 — Why We Condition on $x_0$
We cannot directly compute the reverse distribution:
$$q(x_{t-1} \mid x_t)$$because evaluating it requires integrating over all possible images. So, we "cheat" by conditioning on the clean image $x_0$, which makes the math tractable
$$q(x_{t-1} \mid x_t,\, x_0)$$We will eliminate $x_0$ later by substituting in the noise prediction from our model.
Step 2 — Apply Bayes' Theorem
Bayes' theorem in three-variable form is:
$$P(A \mid B, C) = \frac{P(B \mid A, C)\cdot P(A \mid C)}{P(B \mid C)}$$Set $A = x_{t-1}$, $B = x_t$, $C = x_0$. Then:
$$q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}, x_0)\cdot q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)}$$The forward process is a Markov chain: each state $x_t$ depends only on the immediately preceding state $x_{t-1}$, not on any earlier history. Formally, for any $s < t-1$:
$$q(x_t \mid x_{t-1}, x_s) = q(x_t \mid x_{t-1})$$So that allows us to drop $x_0$ because $x_{t-1}$ is known:
$$q(x_t \mid x_{t-1}, x_0) = q(x_t \mid x_{t-1})$$The denominator $q(x_t \mid x_0)$ does not depend on $x_{t-1}$, so it is just a normalisation constant. Dropping it gives the proportionality:
$$q(x_{t-1} \mid x_t, x_0) \propto q(x_t \mid x_{t-1})\cdot q(x_{t-1} \mid x_0)$$We use this method of proportionality to isolate the important terms, which makes things simpler.
Step 3 — Write the Gaussians
Recall from the forward process derivation that, using $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$:
$$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{\alpha_t}\,x_{t-1},\; (1-\alpha_t)\mathbf{I}\right)$$ $$q(x_{t-1} \mid x_0) = \mathcal{N}\!\left(x_{t-1};\; \sqrt{\bar{\alpha}_{t-1}}\,x_0,\; (1-\bar{\alpha}_{t-1})\mathbf{I}\right)$$These are all Gaussians with known means and variances — exactly the form we need to multiply together in log-space. Why log-space ? This is one of those, "Because Computers" reasons. If we don't use log scale, gradients and values can either get really large or really small. Log-scale smooths things out and makes everything more stable.
Step 4 — Work in Log-Space
For a Gaussian, $\log \mathcal{N}(x;\,\mu,\,\sigma^2) \propto -\frac{(x-\mu)^2}{2\sigma^2}$. Taking the log of the proportionality from Step 2 and treating only the $x_{t-1}$-dependent parts:
$$\log q(\cdot) \propto -\frac{(x_t - \sqrt{\alpha_t}\,x_{t-1})^2}{2(1-\alpha_t)} -\frac{(x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}\,x_0)^2}{2(1-\bar{\alpha}_{t-1})}$$Why can we omit the $1/(\sigma\sqrt{2\pi})$ part of the PDF ? Again proportionality, nothing depends on $x_{t-1}$ so it's just a constant.
Expand the Terms
Term 1 — expand the square in the numerator:
$$-\frac{\alpha_t\,x_{t-1}^2 - 2\sqrt{\alpha_t}\,x_t\,x_{t-1} + x_t^2}{2(1-\alpha_t)}$$The $x_t^2$ piece does not involve $x_{t-1}$, so it folds into the constant we are ignoring.
Term 2 — expand similarly:
$$-\frac{x_{t-1}^2 - 2\sqrt{\bar{\alpha}_{t-1}}\,x_0\,x_{t-1} + \bar{\alpha}_{t-1}\,x_0^2}{2(1-\bar{\alpha}_{t-1})}$$Again the $x_0^2$ piece is constant in $x_{t-1}$ and can be dropped.
Collect Terms in $x_{t-1}$
Grouping the $x_{t-1}^2$ and $x_{t-1}$ coefficients:
$$\propto -\frac{1}{2} \left( \frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}} \right)x_{t-1}^2 + \left( \frac{\sqrt{\alpha_t}\,x_t}{1-\alpha_t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\,x_0}{1-\bar{\alpha}_{t-1}} \right)x_{t-1}$$Recall that the full log of a Gaussian is:
$$\log \mathcal{N}(x;\,\mu,\,\sigma^2) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$$Expanding the square and dropping all terms that do not depend on $x$:
$$\propto -\frac{x^2}{2\sigma^2} + \frac{\mu\,x}{\sigma^2} \;=\; -\frac{1}{2\sigma^2}x^2 + \frac{\mu}{\sigma^2}x$$This is exactly the log of a Gaussian in $x_{t-1}$. We can read off the variance and mean by matching to the standard form $-\frac{1}{2\sigma^2}x^2 + \frac{\mu}{\sigma^2}x$.
Step 5 — Posterior Variance
Look at what the collected expression gives us for the $x_{t-1}^2$ coefficient:
$$-\frac{1}{2}\left(\frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}}\right)x_{t-1}^2$$In the standard Gaussian log-form, that coefficient is $-\frac{1}{2\sigma^2}$, so the whole bracket equals $\frac{1}{\sigma^2}$ reciprocal variance, not the variance itself. That is why we write $\frac{1}{\tilde{\beta}_t}$ rather than $\tilde{\beta}_t$. The algebra hands us the reciprocal variance, and we will invert at the end to get the variance. This confused me for a bit which is why I emphasise it.
Matching the $x_{t-1}^2$ coefficient to $\frac{1}{\tilde{\beta}_t}$:
$$\frac{1}{\tilde{\beta}_t} = \frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}}$$Combining over a common denominator:
$$\frac{1}{\tilde{\beta}_t} = \frac{\alpha_t(1-\bar{\alpha}_{t-1}) + (1-\alpha_t)}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}$$The numerator simplifies using $\bar{\alpha}_t = \alpha_t\bar{\alpha}_{t-1}$:
$$\alpha_t - \alpha_t\bar{\alpha}_{t-1} + 1 - \alpha_t = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t$$Inverting gives the posterior variance:
$$\boxed{ \tilde{\beta}_t = \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} }$$Since $1-\alpha_t = \beta_t$, this can also be written as $\tilde{\beta}_t = \frac{\beta_t(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}$.
Step 6 — Posterior Mean
Matching the $x_{t-1}$ coefficient to $\frac{\tilde{\mu}_t}{\tilde{\beta}_t}$
$$\frac{\tilde{\mu}_t} {\tilde{\beta}_t} = \left( \frac{\sqrt{\alpha_t}\,x_t}{1-\alpha_t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\,x_0}{1-\bar{\alpha}_{t-1}} \right)$$Multiply by $\tilde{\beta}_t$
$$\tilde{\mu}_t = \tilde{\beta}_t \left( \frac{\sqrt{\alpha_t}\,x_t}{1-\alpha_t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\,x_0}{1-\bar{\alpha}_{t-1}} \right)$$Substituting in $\tilde{\beta}_t$ and simplifying each term:
$$\tilde{\mu}_t = \frac{(1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}}{1-\bar{\alpha}_t}\,x_t + \frac{(1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_t}\,x_0$$This is the posterior mean as a weighted combination of $x_t$ and $x_0$. We need to eliminate $x_0$ as we do not have it at inference time. $x_0$ is actually the result that inference is working towards.
Step 7 — Eliminate $x_0$
From the closed-form forward process derived in Part 1:
$$x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon$$Solving for $x_0$:
$$x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon}{\sqrt{\bar{\alpha}_t}}$$Substituting this into the expression for $\tilde{\mu}_t$ and collecting terms (the algebra is tedious, the $x_t$ terms from both parts combine cleanly):
$$\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon \right)$$Final Result
Using $1 - \alpha_t = \beta_t$ and replacing the true noise $\varepsilon$ with the model's noise prediction $\varepsilon_\theta(x_t, t)$:
$$\boxed{ \tilde{\mu}_t(x_t,\,\varepsilon) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon \right) }$$ $$\boxed{ \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t,\,t) \right) }$$Combined with the posterior variance $\tilde{\beta}_t$, this defines the full reverse step distribution:
$$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \tilde{\mu}_t,\; \tilde{\beta}_t\,\mathbf{I}\right)$$Putting It Together: The Denoising Loop
At each reverse step $t$ from $T$ down to $1$:
- Predict the noise: run the neural network forward to get $\varepsilon_\theta(x_t, t)$.
- Compute the mean: $$\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t,t)\right)$$
- Sample the next state: $$x_{t-1} \sim \mathcal{N}(\tilde{\mu}_t,\;\tilde{\beta}_t\,\mathbf{I})$$ (At $t = 1$ the stochastic noise term is usually dropped and you just take $x_0 = \tilde{\mu}_1$.)
Conclusion
Starting from Bayes' theorem and the Gaussian distributions established in Part 1, we derived the complete posterior $q(x_{t-1} \mid x_t, x_0)$: a Gaussian with analytically tractable mean $\tilde{\mu}_t$ and variance $\tilde{\beta}_t$.