Holger von Jouanne-Diedrich takes us through concepts in reinforcement learning:
At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money?
There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case the first variable
e=0.1
is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far.
Click through for various cases and a pathfinding example in R. H/T R-Bloggers