Sampling

Sampling from a known distribution p refers to the process of generating random observations or data points according to the probability distribution of interest.

Usually, our computers can only sample from a uniform distribution, using a pseudorandom number generator (PRNG), and therefore we need to be able to use this simple building block to be able to generate numbers from more complex distributions, preferably in good complexity.

So let’s see methods in which we can sample from some complex distribution using only sampling from a uniform distribution.

Sample from categorical distribution == direct sampling == discrete sampling == Inverse transform sampling

Our random variable x can have k values: 1, …, k, each with probability $\pi_1, …, \pi_k$. Simply sample U(0,1), meaning sample uniformly a number between 0 to 1, and choose

\[x=max \{ i \; \lvert \; \pi_1 + ... + \pi_{i-1} < U \}\]

This is essentially dividing the unit interval into k regions, each having its size $\pi_i$, and return the value of the region in which our uniform sample falls.

In this process we only sampled one number U(0,1), so it is $\textit{sample efficient}$. The disadvantage is that we need a loop, to check where our uniform sample falls, so the complexity grows with the number of categories, and it’s impossible to perform in the continuous case.

This is basically the inverse transform sampling, because our uniform sample U(0,1) can be interpreted as probability, and we return the smallest number (or index) such that the $CDF(i) \geq U$. Inverse transform sampling can work in the continuous case: If we managed to calculate the CDF of p(x), then we just output the number x of which $CDF(x) = U$, or in other words we output $CDF^{-1}(U)$, that’s why it is called $\textit{inverse transform}$.

Rejection sampling

The idea is to sample a number uniformly, and keep it only in probability p(x), otherwise repeat the process. More specifically:

Sample uniformly a number x, from all possible values of the distribution (also called the ‘support’ of the distribution).
Sample uniformly a number y, from the range $ [0, \text{max } p(x) ] $.
If $y > p(x) \; $ reject the x value and return to step 1, otherwise output x.

The advantage is that this method is simple to code. The disadvantage is that there can be cases where we need to reject many samples, until we find the one we need.

Custom Methods

There are custom methods for sampling from some known distributions. For example, sampling from a gaussian (normal) distribution.

Importance ‘sampling’ $\rightarrow$ Importance ‘Estimation’

This is not a sampling method, but rather a method to estimate the expected value of a function under some probability, with a (potentially) lower variance , than a regular monte carlo estimation. I only included it in this article because it has the term ‘sampling’ in it, although I would personally change its name into ‘Importance Estimation’.

The definition of expectation of f(x) under probability p(x) is:

\[\mathbb{E}_{x \sim p}[f(x)] = \sum{}p(x)f(x)\]

Using Monte Carlo, we can estimate this expectation by taking samples (the more the merrier) :

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) \quad x_i \sim p(x) \equiv s\]

Now, this estimator that we call s, is actually a random variable because every time we calculate it, using Monte Carlo, we get a slightly different result, since Monte Carlo is stochastic. Of course we would prefer a lower variance of this result, since it will guarantee more accurate estimation of the true expectation we’re interested in. It turns out (proof needed), that the variance of this estimator is:

\[\mathbb{V}_{x \sim p} [s] = \frac{1}{N} \mathbb{V}_{x \sim p} [f(x)]\]

In words: the variance of the estimator we calculate using monte carlo is equal to the variance of f(x) under the probability distribution p(x), divided by the number of samples we used. Hence, increasing the number of samples, reduces the variance, which is a good thing. By the way, the mean of the estimator, which is the center of the estimator distribution, is centered exactly on the true estimation $\sum{}p(x)f(x)$ we’re trying to find, this is why it is called an unbiased estimator. Now, the question is if we can improve this variance, and that’s where importance sampling comes into play.

Maybe, we can find another distribution $q(x)$, that will help us reduce the variance:

\[\mathbb{E}_{x \sim p}[f(x)] = \sum{}p(x)f(x) = \sum{}q(x) \frac{p(x)}{q(x)}f(x) = \mathbb{E}_{x \sim q}[\frac{p(x)}{q(x)} f(x)]\]

And now the estimator variance is:

\[\mathbb{V}_{x \sim q} [s] = \frac{1}{N} \mathbb{V}_{x \sim q} [\frac{p(x)}{q(x)} f(x)]]\]

So, if we find q(x) which satisfies this condition

\[\mathbb{V}_{x \sim q} [\frac{p(x)}{q(x)} f(x)]] < \mathbb{V}_{x \sim p} [f(x)]\]

We improved (lowered) our estimation variance.