Actionselection and learningrates in Qlearning
Implementing a Qtable reinforcementlearner is in many ways simple and straightforward and also somewhat tricky. The basic concept is easy to grasp; but, as many have mentioned, reinforcementlearners almost want to work, despite whatever bugs or suboptimal math might be in the implementation.
Here are some quick notes about the approach I've come to use, specifically about actionselection (e.g. epsilongreedy versus UCB) and managing learningrates. They've helped my learners converge to good strategies faster and more reliably. Hopefully they can help you, too!
Many implement Qlearning using epsilongreedy selection, because it's very easy to understand and implement. However, you soon run into its major limitations. For example, if \(\epsilon = 0.1\), performance will always be at least 10% below optimal.
Instead of fiddling with \(\epsilon\) (for example, by decreasing it over time), use a better selector.
Thompson Sampling is an excellent choice here, even over UCB. Although experimental evidence has long shown its excellent performance, it's only recently been proven to have optimal regret bounds^{1, 2, 3}. Additionally, being nondeterministic, it can be used in situations with hidden information, such as bluffing.
Thompson Sampling works, essentially, by sampling at every time step from each action's reward distribution and selecting the action with the highest reward. In order to draw such a sample, you must have estimates of the parameters of each action's reward distribution.
Usually, this involves estimates of mean and variance, although specialized distributions may have other parameters that are more stable to estimate from observations.
It's important to note that, not only can two actions' rewards have different means, but also different variances. So, the Qtable needs to store all of the estimated distribution parameters for each action, not just the means.
Watkins^{4} gives the Q update as:
\[Q_{t+1}(x_t, a_t) = (1  \alpha) Q_t(x_t, a_t) + \alpha r_t\]
This is recognizable as an exponential moving average (EMA).
Because EMAs don't converge, some will substitute an \(\alpha_t\) that decreases over time, often as \(\alpha_t = 1/t\). If there were only a single state/action pair, this would be equivalent to a normal average.
With multiple state/action pairs, however, the weights applied to any single state/action pair will have many skips in the sequence. This causes the estimate of the mean to converge more slowly.
As another remediation, it is common to set \(\alpha_t = a/t\) with \(a \gg 1\). This can often help, but if \(a\) is too large, it will once again slow convergence.
It may be better to recognize that an average is being calculated, and track the appropriate values for each state/action pair directly. Doing so does increase memory requirements, but ensures that each state/action pair's statistics converge quickly and reliably.
Footnotes

^{1} Agrawal & Goyal. Analysis of Thompson Sampling for the multiarmed bandit problem. Proceedings of Machine Learning Research, vol. 23, 2012.

^{2} Kaufman, Korda, & Munos. Thompson sampling: An optimal finite time analysis. Algorithmic Learning Theory. Springer Berlin Heidelberg, 2012. arxiv.org.

^{3} Agrawal & Goyal. Further Optimal Regret Bounds for Thompson Sampling. Proceedings of Machine Learning Research, vol. 31, 2013.

^{4} Watkins. Learning from Delayed Rewards. University of Cambridge, 1989.
Trackbacks
The author does not allow comments to this entry
Comments
Display comments as Linear  Threaded