## Action-selection and learning-rates in Q-learning

Implementing a Q-table reinforcement-learner is in many ways simple and straight-forward and also somewhat tricky. The basic concept is easy to grasp; but, as many have mentioned, reinforcement-learners almost want to work, despite whatever bugs or sub-optimal math might be in the implementation.

Here are some quick notes about the approach I've come to use, specifically about action-selection (e.g. epsilon-greedy versus UCB) and managing learning-rates. They've helped my learners converge to good strategies faster and more reliably. Hopefully they can help you, too!

Many implement Q-learning using epsilon-greedy selection, because it's very easy to understand and implement. However, you soon run into its major limitations. For example, if $$\epsilon = 0.1$$, performance will always be at least 10% below optimal.

Instead of fiddling with $$\epsilon$$ (for example, by decreasing it over time), use a better selector.

Thompson Sampling is an excellent choice here, even over UCB. Although experimental evidence has long shown its excellent performance, it's only recently been proven to have optimal regret bounds1, 2, 3. Additionally, being non-deterministic, it can be used in situations with hidden information, such as bluffing.

Thompson Sampling works, essentially, by sampling at every time step from each action's reward distribution and selecting the action with the highest reward. In order to draw such a sample, you must have estimates of the parameters of each action's reward distribution.

Usually, this involves estimates of mean and variance, although specialized distributions may have other parameters that are more stable to estimate from observations.

It's important to note that, not only can two actions' rewards have different means, but also different variances. So, the Q-table needs to store all of the estimated distribution parameters for each action, not just the means.

Watkins4 gives the Q update as:

$Q_{t+1}(x_t, a_t) = (1 - \alpha) Q_t(x_t, a_t) + \alpha r_t$

This is recognizable as an exponential moving average (EMA).

Because EMAs don't converge, some will substitute an $$\alpha_t$$ that decreases over time, often as $$\alpha_t = 1/t$$. If there were only a single state/action pair, this would be equivalent to a normal average.

With multiple state/action pairs, however, the weights applied to any single state/action pair will have many skips in the sequence. This causes the estimate of the mean to converge more slowly.

As another remediation, it is common to set $$\alpha_t = a/t$$ with $$a \gg 1$$. This can often help, but if $$a$$ is too large, it will once again slow convergence.

It may be better to recognize that an average is being calculated, and track the appropriate values for each state/action pair directly. Doing so does increase memory requirements, but ensures that each state/action pair's statistics converge quickly and reliably.