The Paradox

We know that when we train a model to minimize the squared error loss, the optimal predictor is the one that predicts the conditional mean given the input. See the proof here.

Now, let’s say that somebody gives us two trained models that I run on a test set (example 1):

test set is $\text{[10^5*{x=a,y=1},{x=b,y=10^4}]}$, meaning the average of true labels is 1.1,

Model A predicts \(\text{[10^5*{x=a,p=0},{x=b,p=10^4}]}\) , with SE (squared error) of 10^5, AE of ~10^5, and ME (mean error) of ~1.

Model B predicts \(\text{[10^5*{x=a,p=1},{x=b,p=0}]}\) with SE of 10^8, AE of ~10^4, and ME of ~0.1

We need to choose which model is the best, and we care about the total sum of predictions to be as close as possible to the total sum of labels (or that the average of predictions will be close to the average of labels). Based on squared error, the first model is the best, however, the first model has the worst average (0.1), which is far from the average of true labels (1.1).

The paradox: If we know that minimizing the square error gives us a predictor that predicts better the average, how can it be that the first model predicts worse the average?

Resolution of the paradox

First, the theorem holds only for the optimal predictor: the one that achieves zero squared error loss—which neither of the given models satisfies. The theorem does not say that a model with better squared error than another model will have a prediction average which is better than the other model. If we get a third model that predicts 100000 values of 1.0 and 1 value of 10000.0, its squared error will be 0, and the average 1.1, and it is indeed the optimal one.

Second, a model trained according to the theorem, will have optimal conditional mean (based on the input). If we assume that both models were trained on squared error, then the evaluation set clearly has different input features for each sample, because otherwise the predictions were identical (constant). So the theorem talks about predicting the mean given the same input, but it does not say anything about averaging means across different inputs. The quality of prediction is per some specific input. It does not take into account the marginal distribution of the input (represented by random variable x). There is no guarantee that the average of the conditional means E[Y∣X=x_i] across your test inputs equals the overall mean E[Y], unless your test inputs are drawn i.i.d. from the same distribution as the training data, and your model truly learned the Bayes predictor f*.

Third, If we check a well-trained model on a set of predictions for the same input, naturally all the predictions would be the same (constant), because the predictor is deterministic, and then these constant predictions would definitely minimize the squared error for this specific identical-input set.

Fourth, if our metric is to have the best sum of predictions (or average prediction), choose the model with the best mean error, or mean absolute error, which is the second model in our case.

Few more observations:

If a model can output only a constant prediction (non-input dependent), choosing the global mean as the constant will minimize the overall MSE. (math proof is simple)

If we compare two constant (non-input dependent) prediction models, the model with the lowest MSE, will have its constant-prediction closer to the real average (on the whole test set). (same proof)

When a model prediction is not constant (but input dependent), and deterministic, and it predicts exactly the conditional mean (for each specific input), and our eval set represents the true distribution, then, this model will have the lowest possible (optimal) empirical MSE, among deterministic models (models that predict the same output for the same input), and also best global average, among all possible deterministic models (models that given some input x will always output the same value y).

This does not mean that if model A has lower MSE than model B, it will have a better global average, but only if model A is at the optimum point of MSE among other deterministic models, only then it will have the best average.

If a mse-trained model competes against a non-deterministic model, then in theory there could be a chance that the other model fits exactly the labels, and of course will beat the deterministic model which has optimal MSE amongst every possible deterministic model.

If we compare two models, one is mse-trained and one is mae-trained, the winner depends on the distribution of data. In the example above (example 1), the winner is a mae-trained model. In the following example (example 2) the mse-trained model is the winner: test set is \(\text{[{x=a,y=1},{x=a,y=1},{x=a,y=4}]}\) and mae-trained model always predicts 1, with SE (squared error) of 9, AE of 3, and ME (mean error) of 1. mse-trained model always predicts 2 with SE of 6, AE of 4, and ME of 0.

\[\square\]