Adam Optimizer

The Adam optimizer maintains two exponential moving averages (EMA):

\begin{equation} m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \tag{1} \end{equation}

\begin{equation} v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \tag{2} \end{equation}

with initialization:

\begin{equation} m_0 = 0, \quad v_0 = 0 \tag{3} \end{equation}

The parameter update is:

\begin{equation} w_{t+1} = w_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \tag{4} \end{equation}

where these are the bias corrected terms:

\begin{equation} \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \tag{5} \end{equation}

Why do we need this bias correction?

A proper EMA (exponential moving average) assigns weights (starting from the most recent value):

\begin{equation} (1 - \beta),\; \beta(1 - \beta),\; \beta^2(1 - \beta), \dots \tag{6} \end{equation}

These sum to 1:

\begin{equation} \sum_{i=0}^{\infty} (1 - \beta)\beta^i = 1 \tag{7} \end{equation}

So EMA is a proper weighted average only if it has been running forever.

But in Adam, we start from:

\begin{equation} m_0 = 0 \tag{8} \end{equation}

Let’s unroll the recursion.

Expanding the EMA

From (1):

\begin{equation} m_1 = (1 - \beta) g_1 \tag{9} \end{equation}

\begin{equation} m_2 = \beta (1 - \beta) g_1 + (1 - \beta) g_2 \tag{10} \end{equation}

\begin{equation} m_3 = \beta^2 (1 - \beta) g_1 + \beta (1 - \beta) g_2 + (1 - \beta) g_3 \tag{11} \end{equation}

In general:

\begin{equation} m_t = (1 - \beta)\sum_{i=0}^{t-1} \beta^i g_{t-i} \tag{12} \end{equation}

The Bias

The weights are correct relative to each other (see the proper EMA above), but their sum is:

\begin{equation} \sum_{i=0}^{t-1} (1 - \beta)\beta^i = 1 - \beta^t \tag{13} \end{equation}

So:

\begin{equation} m_t = (1 - \beta^t) \cdot \text{(true EMA)} \tag{14} \end{equation}

This means:

\begin{equation} \mathbb{E}[m_t] = (1 - \beta^t)\mu \tag{15} \end{equation}

where $\mu$ is the true average gradient.

So $m_t$ is biased toward zero, especially at small $t$.

Bias Correction

To fix this, divide by the missing mass:

\begin{equation} \hat{m}_t = \frac{m_t}{1 - \beta^t} \tag{16} \end{equation}

Then:

\begin{equation} \mathbb{E}[\hat{m}_t] = \mu \tag{17} \end{equation}

This gives an unbiased estimate of the EMA.

The same applies to $v_t$.

Why Not Initialize Differently?

An intuitive way would be to initialize it to the first element (gradient) that we saw:

\begin{equation} m_1 = g_1 \tag{18} \end{equation}

Then:

\begin{equation} m_t = \beta^{t-1} g_1 + (1 - \beta)\sum_{i=2}^{t} \beta^{t-i} g_i \tag{19} \end{equation}

Now the weights sum to 1 (that’s good), but:

The first gradient $g_1$ gets weight $\beta^{t-1}$ instead of $(1 - \beta)\beta^{t-1}$ (as it should be in correct EMA). This means the first gradient weight is higher than needed.

To fix it, you would need:

Reduce weight of $g_1$ by reducing $\beta^t g_1$ (this would break the total sum of course), and:
Renormalize everything by dividing by $1-\beta^t$

Which effectively requires remembering the first gradient forever! That means additional x1 memory footprint which we don’t like. It also means more FLOPs as we have additional substract operation.

Why Adam Chooses Zero Initialization

Setting:

\begin{equation} m_0 = 0 \tag{20} \end{equation}

gives:

Correct relative weighting
Simple recursion
No need to store history

The only issue is the scaling factor $(1 - \beta^t)$, which is easy to correct.

So Adam uses:

\begin{equation} \hat{m}_t = \frac{m_t}{1 - \beta^t} \tag{21} \end{equation}

instead of trying to fix initialization.

Intuition in One Line

Bias correction exists because EMA starts from zero, so its total weight is too small, and we divide by $(1 - \beta^t)$ to renormalize it.