Find the optimal policy in MDP using TD(0) learning and UCB

Here you can see a simple MDP, with sparse rewards. There are only two terminal states with rewards 1 and 2, and the rest have rewards of zero. Within each node you can see the value estimation V, plus the confidence interval, together comprising the Upper Confidence Bound (UCB1, see paper), according to:

\[\begin{equation} \hat{v}+\sqrt{\frac{2ln(N)}{n}} \label{eq:1} \end{equation}\]

where $\hat{v}$ is our current value estimation, $N$ is a node parent number of visits, $n$ is the node number of visits. In addition, you can see the number of visits to this node, denoted by #. In each iteration, the MDP tree is traversed, at every decision point the node with the highest UCB is chosen, until reaching a terminal leaf. Then, each two adjacent nodes in the formed trajectory are used to update the parent node according to:

\[V_{parent} += 0.1 * (V_{child} - V_{parent})\]

No discounting $\gamma$ was used, since the MDP only has terminal rewards. After a few dozen steps, the optimal trajectory is being favored, indicating convergence to optimal policy.

Let’s see an explanation for \eqref{eq:1}.

Let’s take the Chernoff bound: $Pr[\hat{v} > v + \epsilon]\leq e^{-2n{\epsilon}^2}$, where $n$ is the number of samples used to calculate the empirical average $\hat{v}$, and $\epsilon$ is any number that we want.

Choose to set $\epsilon = \sqrt{\frac{2lnN}{n}}$, as in the UCB1 formula, where we mix-in $N$, a new variable that denote the number of visits to the parent node. Then we get:

\[Pr[\hat{v} > v + \epsilon]\leq e^{-2n{\epsilon}^2}=e^{-4lnN} = N^{-4}\]

So we can see that if N (the number of times the parent node is visited) grows, the probability that the empirical average $\hat{v}$ exceeds the confidence bound around the true mean $v$, becomes very small, making the confidence bound very reliable.

$\epsilon$-greedy

Let’s see what happens when we apply $\epsilon$-greedy across 200 steps. In every step a random action is chosen in probability $\epsilon$, otherwise the max-value action is chosen. $\epsilon$ starts at 1.0 and decays linearly to 0.0 at step 200.