Biased and Unbiased Estimators
Some definitions
A random variable is a function that maps the input domain (possible outcomes or events, which may have nothing to do with numbers) into numerical values (so we can apply math on them, like expectancy and variance). A random number is actually a deterministic function that has nothing to do with randomness, and better be called a ‘measurable function’.
A single number drawn from a random variable (=function) is called a sample or a realization of that random variable (usually with respect to some probability).
A sample usually refers to a set (few) of observations drawn from a distribution. A single item in a sample is called observation.
An estimator is a function that associates a parameter estimate (number) to each possible sample (set) we can observe.
An estimator of a given parameter is said to be unbiased if its expected value (across many random samples) is equal to the true value of the parameter. In other words, an estimator is unbiased if it produces parameter estimates that are on average correct.
Two useful properties of estimator:
Unbiased estimator - an estimator whose expected value is equal to the true parameter (even with small sample sizes): \(\mathbb{E}[\hat{\theta} ]=\theta\)
Consistent estimator: an estimator that converges to the true value as the sample size (one sample) gets large
So in simple words: unbiased estimator talks about the average of many samples, regardless their size, while consisency talks about what will happen to an estimator which is based on a single sample, when the sample size gets large.
Example: Sample MSD (mean squared deviation, from sample mean)
We have a population with mean $\mu$ and variance $\sigma ^2$. We would like to have a statistic/estimator $s^2$, such that \(\mathbb{E}[s^2 ]=\sigma^2\) (meaning that it is unbiased). The expectancy is across many samples. Let’s see what happens when we define it as the MSD (mean squared deviation) from the sample mean $\bar x$:
\[\begin{equation} \begin{aligned} \mathbb{E}[s^2]\equiv \mathbb{E}[\frac{\sum_{i}(x_i-\bar x)^2}{n}]=\frac{1}{n}\mathbb{E}[\sum_{i}(x_i-\bar x)^2] \end{aligned} \end{equation}\]And this expression is equal to :
\[\begin{equation} \begin{aligned} \frac{1}{n}\mathbb{E}[\sum_{i}((x_i-\mu)-(\bar x-\mu) )^2]= \frac{1}{n}\mathbb{E}[\sum_{i} x_i^2-2x_i\mu+\mu^2-2(x_i-\mu)(\bar x-\mu) +\bar x^2-2\bar x \mu+\mu^2] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} = \frac{1}{n}\mathbb{E}[\sum_{i} x_i^2-2x_i\mu+\mu^2-2( x_i \bar x -x_i \mu -\mu \bar x + \mu^2 ) +\bar x^2-2\bar x \mu+\mu^2] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} = \frac{1}{n}\mathbb{E}[\sum_{i} x_i^2 -2( x_i \bar x ) +\bar x^2 ] = \frac{1}{n}\mathbb{E}[\sum_{i} (x_i-\bar x)^2 ] \end{aligned} \end{equation}\]So we can write:
\[\begin{equation} \begin{aligned} = \frac{1}{n}\mathbb{E}[ \sum_{i} (x_i-\mu)^2 -2 \sum_{i} (x_i-\mu)(\bar x -\mu) + \sum_{i} (\bar x -\mu)^2 ] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} = \frac{1}{n}\mathbb{E}[ \sum_{i} (x_i-\mu)^2 -2 (\bar x -\mu) \sum_{i} (x_i-\mu) + n (\bar x -\mu)^2 ] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} = \frac{1}{n}\mathbb{E}[ \sum_{i} (x_i-\mu)^2 -2n (\bar x -\mu)^2 + n(\bar x -\mu)^2 ] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} = \frac{1}{n}[ \sum_{i} \mathbb{E}[(x_i-\mu)^2] - n\mathbb{E}[(\bar x -\mu)^2 ]] \end{aligned} \end{equation}\]Now, we know that $ \mathbb{E}[(x_i-\mu)^2]=\sigma^2 $ since this is the definition of population variance. We also know from central limit theorem that $ \mathbb{E}[(\bar x -\mu)^2]=\frac{\sigma^2}{n}$, meaning that the empirical sample average variance is reduced by a factor of n, in respect to the population variance. Putting it together:
\[\begin{equation} \begin{aligned} = \frac{1}{n}[ n \sigma^2 - n\frac{\sigma^2}{n} ]] = \frac{\sigma^2}{n}[ n - 1] \end{aligned} \end{equation}\]So we can see that we got a biased estimate. If we want to correct it, we can do:
\[\begin{equation} \begin{aligned} \frac{n}{n-1} \mathbb{E}[s^2] = \sigma^2 \quad \square \end{aligned} \end{equation}\]Example: Sample MSE
Sample MSE: Proof that sample MSE (predictions to labels) on an independent test set is indeed an unbiased estimator of its true (population) mean squared error We want to show that for a given (fixed) predictor $\hat f$, the sample MSE,
\[\begin{equation} \begin{aligned} R=\frac{1}{n} \sum_{i} (y_i-\hat f(x_i))^2 \end{aligned} \end{equation}\]equals to the true error \(\mathbb{E}[R]=\mathbb{E}_{x,y\sim P}[(y-\hat f(x))^2]\) when the set is i.i.d sampled from P. So:
\[\begin{equation} \begin{aligned} \mathbb{E}[R]=\mathbb{E}[\frac{1}{n} \sum_{i} (y_i-\hat f(x_i))^2]=\frac{1}{n} \sum_{i} \mathbb{E}[ (y_i-\hat f(x_i))^2]=\frac{1}{n} \sum_{i} \mathbb{E}_{x,y\sim P}[ (y-\hat f(x))^2] \end{aligned} \end{equation}\] \[\begin{equation} \begin{aligned} =\frac{1}{n} n \mathbb{E}_{x,y\sim P}[ (y-\hat f(x))^2]= \mathbb{E}_{x,y\sim P}[ (y-\hat f(x))^2] \end{aligned} \end{equation}\]So sample MSE, without any correction, is unbiased. $\quad \square$
Example: Sample MAE
Same proof applies to sample MAE, which makes it unbiased.