Move from discrete time diffusion to continous.

Forward DDPM formula for a discrete time $t\in [0,T]$ is given by:

\begin{equation} x(t)=\sqrt{1-\beta(t)}x(t-1)+\sqrt{\beta(t)}\epsilon_t \end{equation}

where $0<\beta_t<1$.

Now I want to change variables to make it continuous time and also to make it on the trajectory of $\tau \in[0,1]$ instead of $t \in [0,T]$. So I define: $\tau:=\frac{t}{T}\rightarrow d\tau = \frac{1}{T} dt$ and also $x’(\tau):=x(\tau T)$, and $\epsilon’(\tau):=N(0,I)$ and $\beta’(\tau) d \tau:=\beta(t)$ (this is tricky and not trivial. When we increase number of steps to make it continous we need to reduce the amount of drift and noise we add at each step, so we need to scale $\beta$ by $d\tau$). Now we can rewrite the forward DDPM formula as:

\begin{equation} x’(\tau)=\sqrt{1-\beta’(\tau)d\tau}x’(\tau-1/T)+\sqrt{\beta’(\tau)d\tau}\epsilon’(\tau) \end{equation}

Now use taylor to get

\begin{equation} x’(\tau)=(1-0.5\beta’(\tau)d\tau)x’(\tau-1/T)+\sqrt{\beta’(\tau)d\tau}\epsilon’(\tau) \end{equation}

\begin{equation} x’(\tau)-x’(\tau-1/T)=-0.5\beta’(\tau)d\tau x’(\tau-1/T)+\sqrt{\beta’(\tau)d\tau}\epsilon’(\tau) \end{equation}

Now define $\Delta \tau := 1/T$ and take $\lim_{T \to \infty}$ to get:

\begin{equation} dx’(\tau)=-0.5\beta’(\tau) x’(\tau) d\tau +\sqrt{\beta’(\tau)d \tau}\epsilon’(\tau) \end{equation}

If we define $dW(\tau):=\sqrt{d\tau}\epsilon’(\tau)$, get rid of all apostrophes and rewrite $\tau$ as $t$ we get the Wiener/Ito/Brownian process form of the forward DDPM formula: \begin{equation} dx(t)=-0.5\beta(t) x(t) dt +\sqrt{\beta(t)}dW(t) \end{equation}

where $t\in [0,1]$ and $beta$ is a function of $t$ which starts from value of 0 and reaches 1 any way we want. \

Now for the reverse process.

Define $f(x,t):=-0.5\beta(t)x(t)$ and $g(t):=\sqrt{\beta(t)}$. Then we can write the forward process as: \begin{equation}dx(t)=f(x,t)dt + g(t)dW(t) \end{equation}
It shows how to sample (based on normal distribution) the next differential change in x, given the current value of x and time t. Let’s rewrite:

\begin{equation} x(t+dt)=x(t) + f(x,t)dt + g(t)dW(t) \end{equation} And that means:

\begin{equation} p(x(t+dt)| x(t)) = \mathcal{N}(x(t)+f(x,t)dt, g(t)^2 dt I) \end{equation}

Define the older step as $y:=x(t+dt)$ and the newer step as $x:=x(t)$, and then $p(x|y)=p(y|x)p(x)$, and since $f(x,t)\approx f(y,t)$, so we have:

\begin{equation} p(y|x)\sim exp(-\frac{||y-x-f(y,t)dt||^2}{2g(t)^2 dt}) \end{equation}

Let’s expand $log p(x)$ around y:

\begin{equation} log p(x) = log p(y)+ (x-y)^T \nabla_x log p(y) \end{equation}

\begin{equation} log p(x|y) = log p(y|x) +log p(x) \end{equation}

So we have:

\begin{equation} log p(x|y) = -\frac{||y-x-f(y,t)dt||^2}{2g(t)^2 dt} + (x-y)^T \nabla_x log p(y) + C \end{equation}

Define the score function $s:=\nabla_x log p(y) $ and $f:=f(y,t)$ and $g:=g(t)$ to get:

\begin{equation} log p(x|y) = -\frac{||y-x-fdt||^2}{2g^2 dt} + (x-y)^T s + C \end{equation}

Expand the square and ignore small terms:

$(x-y+fdt)^T(x-y+fdt)=(x-y)^2+2(x-y)^Tfdt$

\begin{equation} log p(x|y) = -\frac{(x-y)^2+2(x-y)^Tfdt}{2g^2 dt} + (x-y)^T s + C \end{equation}

We want to convert it into normal dist.

\begin{equation} log p(x|y) = -\frac{(x-y)^2+2(x-y)^Tfdt - (x-y)^T s 2g^2dt}{2g^2 dt} + C \end{equation}

\begin{equation} log p(x|y) = -\frac{(x-y)^2+2(x-y)^T(fdt-sg^2dt)}{2g^2 dt} + C \end{equation}

\begin{equation} log p(x|y) = -\frac{(x-y)^2+2(x-y)^T(fdt-sg^2dt)+(fdt-sg^2dt)^2}{2g^2 dt} + C’ \end{equation}

\begin{equation} log p(x|y) = -\frac{(x-(y-fdt+sg^2dt))^2}{2g^2 dt} + C’ \end{equation}

So we now have the reverse SDE:

\begin{equation} p(x|y) = N(y-fdt+sg^2dt, g^2dt) \end{equation}

And now we are able to sample backwards only if we knew the score function $s:=\nabla_x log p(y) $ . We need to train a neural network to predict the score function at each time step, and then we can sample backwards using the above formula.

How to learn the score function.

So from the direct DDPM formula that:

\begin{equation} x_t=\alpha_t x_0 + \sigma_t \epsilon \rightarrow p(x_t|x_0) = N(\alpha_t x_0, \sigma_t^2 I) \end{equation}

\begin{equation} log p(x_t|x_0) = -\frac{||x_t-\alpha_t x_0||^2}{2\sigma_t^2} + C \rightarrow \nabla_{x_t} log p(x_t|x_0) = -\frac{x_t-\alpha_t x_0}{\sigma_t^2}=-\frac{\sigma_t \epsilon}{\sigma_t^2} \end{equation}

So we have

\begin{equation} \nabla_{x_t} log p(x_t|x_0) =-\frac{ \epsilon}{\sigma_t} \end{equation}

So a network that can predict the noise $\epsilon$ can be directly scaled to predict the score function.

Equivalence between conditional score and non conditional score.

We know that \begin{equation} p(x)=\int_{x_0} p(x|x_0)p(x_0)dx_0 \rightarrow \nabla_x p(x) = \int_{x_0} \nabla_x p(x_0|x) p(x_0) dx_0 \end{equation}

Divide by $p(x)$ and use the log trick to get:

\begin{equation} \nabla_x log p(x) = \frac{\nabla_x p(x)}{p(x)} = \frac{1}{p(x)} \int_{x_0} p(x|x_0) \nabla_x log p(x|x_0) p(x_0) dx_0 \end{equation}

\begin{equation} \nabla_x log p(x) = \int_{x_0} \frac{p(x|x_0) p(x_0)}{p(x)} \nabla_x log p(x|x_0) dx_0= \int_{x_0} p(x_0|x) \nabla_x log p(x|x_0) dx_0 \end{equation}

\begin{equation} \nabla_x log p(x) = E_{x_0|x} \nabla_x log p(x|x_0) \end{equation} This is the connection between unconditional score function to the conditional score.

Solvers.

Once we have the continuous form for the SDE:

\begin{equation} p(x|y) = N(y-fdt+sg^2dt, g^2dt) \end{equation}

We start from time $t=1$ of completely noised image and then we want to integrate/solve for time $t=0$ to get the final image. We can use any SDE solver for this, for example Euler Maruyama or Heun.