Square Error Loss and Cross Entropy Loss Derivation using Maximum Likelihood and Interpretation as Average
Deriving Squared Error Minimization from Likelihood Maximization
Recall that in Maximum Likelihood we would like to optimize the parameters $\theta$ of some model, so that the probability that a sample S we observed will be maximized:
\[\begin{equation} \label{eq:1} \begin{aligned} \theta^* = \arg\max_{\theta} \mathcal{L} (S;\theta) \end{aligned} \end{equation}\]Sometimes we have a generative model with parameters $\theta$ that generates outputs that we observe, and sometimes we have a model with parameters $\theta$ that takes an input i and generates and output (prediction) that we observe.
Let’s assume that sample S is a set of observations $ x_i $ .
If we’re dealing with a predictor model, then we happen to have an input $i_i$ for each output $ x_i $ and also a prediction for each input. Let’s denote the prediction as $\hat{x_i}$. If we postulate that the probability that the output (that we observe) for a given input and parameters to be further away (different) from a prediction we made has a gaussian behaviour, then we can write:
\[\begin{equation} \label{eq:2} \begin{aligned} p(x_i | \hat{x_i}(i_i, \theta)) = \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} \end{aligned} \end{equation}\]Again, this is an assumption that we make. Now, we have a sample S, and for each instance in the sample, we have a prediction $ \hat{x_i}(i_i, \theta))$ that is based on the input $i_i$ and the parameters of the model $\theta$. The probability/likelihood to see this sample is therefore:
\[\begin{equation} \label{eq:3} \begin{aligned} \mathcal{L} (S= \{ x_i \} | \hat{x_i}(i_i, \theta)) = \prod_{i} \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} \end{aligned} \end{equation}\]So we want to optimize it and find the maximum likelihood in repsect to $ \theta $. We can’t touch the inputs or the observed outputs; we can only play with $\theta$.
\[\begin{equation} \label{eq:4} \begin{aligned} \theta^* = \arg\max_{\theta} \mathcal{L} ( \underbrace{ S}_{\text{observed}}; \underbrace{ \hat{x_i}(i_i, \theta)}_{\text{predicted}} ) \overbrace{=}^{\text{monotonic increasing}} \arg\max_{\theta} log \mathcal{L} (S; \hat{x_i}(i_i, \theta)) \overbrace{=}^{\text{log of product}} \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} \arg\max_{\theta} \sum_{i} log \: \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} = \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} \arg\max_{\theta} \sum_{i} (log \: \frac{1}{\sigma \sqrt{2\pi}} \: + log \: e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} )= \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:6} \begin{aligned} \arg\max_{\theta} \sum_{i} {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} = \arg\max_{\theta} \sum_{i} {-(x_i - \hat{x_i}(i_i, \theta))^2} \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:7} \begin{aligned} \theta^* = \arg\min_{\theta} \sum_{i} {(x_i - \hat{x_i}(i_i, \theta))^2} \end{aligned} \end{equation}\]And this is exactly the Squared Error loss function. The optimal parameters $ \theta^* $ are obtained when we minimize the squared distance/error between our predictions $\hat{x_i}(i_i, \theta)$, that are based on the parameters, and to their respective outputs (observations). So, the maximum likelihood is obtained, given a gaussian distribution assumption of the distance between prediction and observable, when we minimize the squared distance between our predictions and observations. $\square $
Conditional expectation is the best predictor when we minimize Squared Error loss
In previous section we started with likelihood maximization for the gaussian distribution assumption, and discoverted the squared loss minimization. Here, we start with Squared Error loss, and we ask: what is the best predictor that minimize this squared error loss?
Consider two random variables x (the input) and y (the output). We ask: what is the best predictor function $f^*(x)$ so that:
\[\begin{equation} \begin{aligned} f^* = \arg\min_{f} \mathbb{E}_{x,y}[(y-f(x))^2] \end{aligned} \end{equation}\]The subscripts for the expectation denote that it goes over all the dimension of x,y. In other words, we try to find the best prediction given some input x.
Let’s define some helper terms for easiness: $ f=f(x) , z= \mathbb{E}_y[y|x]=z(x)$
Let’s open the squared term and then add these terms which each is zero: $(z^2+z^2-2z^2)$ , $(2yz-2yz)$, $(2zf-2zf)$:
\[\begin{equation} \begin{aligned} \mathbb{E}_{x,y}[(y-f(x))^2] = \mathbb{E}_{x,y}[ y^2-2yf+f^2 ] = \mathbb{E}_{x,y}[ y^2+z^2-2yz+z^2+f^2-2zf+2yz-2yf-2z^2+2zf ] \end{aligned} \end{equation}\]Now let’s group together:
\[\begin{equation} \begin{aligned} = \mathbb{E}_{x,y}[(y-z)^2] + \mathbb{E}_{x,y}[(z-f)^2]+ 2\mathbb{E}_{x,y}[(y-z)(z-f)] \end{aligned} \end{equation}\]So we have three terms. The first one does not depend on the predictor $f(x)$, so does not affect the argmin. It is usually not zero, and cannot be reduced by the predictor. The last term, we will show, is equal to zero. So, we can see that the middle term will be zero, and thus minimized, when $f=z=\mathbb{E}_y[y|x]$, meaning that the best predictor is the one that, given some input x, predicts the conditional expectation for y, regardless that what the distribution is. It can be any kind of distribution between x and y.
So now we need to show the last term is indeed zero. First, using law of total expectation, we can break the joint expectation: $\mathbb{E}_{x,y}[\cdot]=\mathbb{E}_x[\mathbb{E}_y[\cdot|x]]$
\[\begin{equation} \begin{aligned} \mathbb{E}_{x,y}[(y-z)(z-f)] = \mathbb{E}_x[\mathbb{E}_y[(y-z)(z-f)\|x]] \end{aligned} \end{equation}\]Now since the inner expectation is over y, we can pull out $(z-f)$ which are only functions of x:
\[\begin{equation} \begin{aligned} = \mathbb{E}_x[(z-f)\quad \mathbb{E}_y[(y-z)\|x] \:] \end{aligned} \end{equation}\]Use the fact that expecation is linear to break the right term
\[\begin{equation} \begin{aligned} = \mathbb{E}_x[(z-f)\quad (\mathbb{E}_y[y|x] - \mathbb{E}_y[(z|x]) \:] = \mathbb{E}_x[(z-f)\quad (\mathbb{E}_y[y|x] - \mathbb{E}_y[(\mathbb{E}_y[y|x]|x]) \:] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} = \mathbb{E}_x[(z-f)\quad (\mathbb{E}_y[y|x] - \mathbb{E}_y[y|x]*\mathbb{E}_y[(1|x]) \:] = \mathbb{E}_x[(z-f)\quad (\mathbb{E}_y[y|x] - \mathbb{E}_y[y|x]) \:] = \mathbb{E}_x[(z-f)\quad *0 \:] = 0 \end{aligned} \end{equation}\]So we can see it is indeed zero. $\quad\square $
If we had used the absolute error as our loss function, the optimal predictor would output the median of y given the input x.
Deriving Cross Entropy loss minimization from Likelihood Maximization (Maximum Likelihood)
Here, we look at a different model, that predicts probabilities instead of nominal scalars (regression): $ \hat{x_i}(i_i, \theta) \in [0,1]$ and the true observation $x_i$ can be either 0 or 1.
Here $ \hat{x_i}(i_i, \theta) $ means: What is the probability to get an observation label of 1.
For a single example, if the observation is 1, then the output is $\hat{x_i}(i_i, \theta)$ basically means that: the chance that the true label is 1, given our current model, is $\hat{x_i}(i_i, \theta)$. For example, if the observation is 1, and the prediction is probability of 0.9, it means that the likelihood that we observed actual output (observable) value of 1, given our current model, is 0.9. If the prediction was 0.2, the probability/likelihood to observe 1, is only 0.2, which is less likely.
If the observable is 0, the likelihood is $1-\hat{x_i}(i_i, \theta)$, for example if we predicted 0.2, then the likelihood to observe 0 is 0.8.
Combining the two cases together into one equation, we get:
This is different from the regression (gaussian) model we presented above:
\[\begin{equation} \label{eq:9} \begin{aligned} p(x_i | \hat{x_i}(i_i, \theta)) = \frac{1}{\sigma \sqrt{2\pi}} e ^ {-\frac{(x_i - \hat{x_i}(i_i, \theta))^2}{2 \sigma^2}} \end{aligned} \end{equation}\]Now we can check what is the likelihood to see a bunch of samples:
\[\begin{equation} \label{eq:10} \begin{aligned} \mathcal{L} (S= \{ x_i \} | \hat{x_i}(i_i, \theta)) = \prod_{i} \hat{x_i}(i_i, \theta) ^ {x_i}(1-\hat{x_i}(i_i, \theta))^{1-x_i} \end{aligned} \end{equation}\]and then
\[\begin{equation} \label{eq:11} \begin{aligned} \theta^* = \arg\max_{\theta} log \mathcal{L} (S; \hat{x_i}(i_i, \theta)) =\arg\max_{\theta} \sum_{i} x_i log \hat{x_i}(i_i, \theta) + (1-x_i) log (1-\hat{x_i}(i_i, \theta)) \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:12} \begin{aligned} \theta^* =\arg\min_{\theta} - \sum_{i} x_i log \hat{x_i}(i_i, \theta) + (1-x_i) log (1-\hat{x_i}(i_i, \theta)) \end{aligned} \end{equation}\]And this is the cross entropy, when we have two possible outcomes (two classes), so it is called Binary Cross Entropy.
If we have more than two classes/outcomes, then we will have a seperate prediction for each outcome/class and also a seperate observation for each outcome.
Here $ \hat{x_{ij}}(i_i, \theta) $ means: What is the probability that class j of example i is positive (or 1).
Then, our Cross Entropy loss will be:
\[\begin{equation} \label{eq:13} \begin{aligned} \theta^* =\arg\min_{\theta} - \sum_{i} \sum_{j} x_{ij} \: log \: \hat{x_{ij}}(i_i, \theta) \end{aligned} \end{equation}\]Where i iterates on the samples, and j over the different classes. $\square $
To summarize, we’ve seen how the two popluar loss functions (MSE, Cross Entropy) are derived using the sample general idea of maximum likelihood.