Recall that in Maximum Likelihood we would like to optimize the parameters $\theta$ of some model, so that the probability that a sample S we observed will be maximized:

\[\begin{equation} \label{eq:1} \begin{aligned} \theta^* = \arg\max_{\theta} \mathcal{L} (S;\theta) \end{aligned} \end{equation}\]

Sometimes we have a generative model with parameters $\theta$ that generates outputs that we observe, and sometimes we have a model with parameters $\theta$ that takes an input i and generates and output (prediction) that we observe.

Let’s assume that sample S is a set of observations $ x_i $ .

If we’re dealing with a predictor model, then we happen to have an input $i_i$ for each output $ x_i $ and also a prediction for each input. Let’s denote the prediction as $\hat{x_i}$. If we postulate that the probability that the output (that we observe) for a given input and parameters to be further away (different) from a prediction we made has a gaussian behaviour, then we can write:

\[\begin{equation} \label{eq:2} \begin{aligned} p(x_i | \hat{x_i}(i_i, \theta)) = \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} \end{aligned} \end{equation}\]

Again, this is an assumption that we make. Now, we have a sample S, and for each instance in the sample, we have a prediction $ \hat{x_i}(i_i, \theta))$ that is based on the input $i_i$ and the parameters of the model $\theta$. The probability/likelihood to see this sample is therefore:

\[\begin{equation} \label{eq:3} \begin{aligned} \mathcal{L} (S= \{ x_i \} | \hat{x_i}(i_i, \theta)) = \prod_{i} \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} \end{aligned} \end{equation}\]

So we want to optimize it and find the maximum likelihood in repsect to $ \theta $. We can’t touch the inputs or the observed outputs; we can only play with $\theta$.

\[\begin{equation} \label{eq:4} \begin{aligned} \theta^* = \arg\max_{\theta} \mathcal{L} ( \underbrace{ S}_{\text{observed}}; \underbrace{ \hat{x_i}(i_i, \theta)}_{\text{predicted}} ) \overbrace{=}^{\text{monotonic increasing}} \arg\max_{\theta} log \mathcal{L} (S; \hat{x_i}(i_i, \theta)) \overbrace{=}^{\text{log of product}} \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} \arg\max_{\theta} \sum_{i} log \: \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} = \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} \arg\max_{\theta} \sum_{i} (log \: \frac{1}{\sigma \sqrt{2\pi}} \: + log \: e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} )= \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:6} \begin{aligned} \arg\max_{\theta} \sum_{i} {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} = \arg\max_{\theta} \sum_{i} {-(x_i - \hat{x_i}(i_i, \theta))^2} \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:7} \begin{aligned} \theta^* = \arg\min_{\theta} \sum_{i} {(x_i - \hat{x_i}(i_i, \theta))^2} \end{aligned} \end{equation}\]

And this is exactly the Squared Error loss function. The optimal parameters $ \theta^* $ are obtained when we minimize the squared distance/error between our predictions $\hat{x_i}(i_i, \theta)$, that are based on the parameters, and to their respective outputs (observations). So, the maximum likelihood is obtained, given a gaussian distribution assumption of the distance between prediction and observable, when we minimize the squared distance between our predictions and observations. $\square $

Deriving Cross Entropy loss from Maximum Likelihood

Here, we look at a different model, that predicts probabilities instead of nominal scalars (regression): $ \hat{x_i}(i_i, \theta) \in [0,1]$ and the true observation $x_i$ can be either 0 or 1.

Here $ \hat{x_i}(i_i, \theta) $ means: What is the probability to get an observation label of 1.

For a single example, if the observation is 1, then the output is $\hat{x_i}(i_i, \theta)$ basically means that: the chance that the true label is 1, given our current model, is $\hat{x_i}(i_i, \theta)$. For example, if the observation is 1, and the prediction is probability of 0.9, it means that the likelihood that we observed actual output (observable) value of 1, given our current model, is 0.9. If the prediction was 0.2, the probability/likelihood to observe 1, is only 0.2, which is less likely.
If the observable is 0, the likelihood is $1-\hat{x_i}(i_i, \theta)$, for example if we predicted 0.2, then the likelihood to observe 0 is 0.8.
Combining the two cases together into one equation, we get:

\[\begin{equation} \label{eq:8} \begin{aligned} p(x_i | \hat{x_i}(i_i, \theta)) = \hat{x_i}(i_i, \theta) ^ {x_i}(1-\hat{x_i}(i_i, \theta))^{1-x_i} \end{aligned} \end{equation}\]

This is different from the regression (gaussian) model we presented above:

\[\begin{equation} \label{eq:9} \begin{aligned} p(x_i | \hat{x_i}(i_i, \theta)) = \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} \end{aligned} \end{equation}\]

Now we can check what is the likelihood to see a bunch of samples:

\[\begin{equation} \label{eq:10} \begin{aligned} \mathcal{L} (S= \{ x_i \} | \hat{x_i}(i_i, \theta)) = \prod_{i} \hat{x_i}(i_i, \theta) ^ {x_i}(1-\hat{x_i}(i_i, \theta))^{1-x_i} \end{aligned} \end{equation}\]

and then

\[\begin{equation} \label{eq:11} \begin{aligned} \theta^* = \arg\max_{\theta} log \mathcal{L} (S; \hat{x_i}(i_i, \theta)) =\arg\max_{\theta} \sum_{i} x_i log \hat{x_i}(i_i, \theta) + (1-x_i) log (1-\hat{x_i}(i_i, \theta)) \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:12} \begin{aligned} \theta^* =\arg\min_{\theta} - \sum_{i} x_i log \hat{x_i}(i_i, \theta) + (1-x_i) log (1-\hat{x_i}(i_i, \theta)) \end{aligned} \end{equation}\]

And this is the cross entropy, when we have two possible outcomes (two classes), so it is called Binary Cross Entropy.

If we have more than two classes/outcomes, then we will have a seperate prediction for each outcome/class and also a seperate observation for each outcome.

Here $ \hat{x_{ij}}(i_i, \theta) $ means: What is the probability that class j of example i is positive (or 1).

Then, our Cross Entropy loss will be:

\[\begin{equation} \label{eq:13} \begin{aligned} \theta^* =\arg\min_{\theta} - \sum_{i} \sum_{j} x_{ij} \: log \: \hat{x_{ij}}(i_i, \theta) \end{aligned} \end{equation}\]

Where i iterates on the samples, and j over the different classes. $\square $

To summarize, we’ve seen how the two popluar loss functions (MSE, Cross Entropy) are derived using the sample general idea of maximum likelihood.