Interactive Bandits

To help students get a better feel for three of the most popular “multi-armed bandit” exploration/exploitation balancing strategies (Epsilon Greedy, Thompson Sampling, and Upper Confidence Bound), I combined my R package “contextual” with the neat “animation” package to create some cool interactive animated simulations. The complete source code of all three animations can be found in this Gist.

The exploration/exploitation trade-off is a dilemma we frequently face in choosing between options. Should you choose what you know and get something close to what you expect (โ€˜exploitโ€™) or choose something you are not all that sure about and possibly learn more (โ€˜exploreโ€™)? This happens all the time in everyday life โ€” Choose your favourite restaurant, or the new one? Remain in your current job, or hunt around?

The best way to get a feel for each algorithm is probably to step through each simulation at a leisurely pace. Tip: you can also use the arrow keys on your keyboard to step through the animations.

Epsilon Greedy Policy

First, a code snippet that runs a “contextual” based Epsilon Greedy policy simulation with an epsilon of 0.1. For the full source code (including the part that compiles the animation), see here.

An Epsilon Greedy bandit policy simply choose an arm at random (explores) with probability epsilon, otherwise it greedily chooses (exploits) the arm with the highest estimated reward.

The best arm in the simulation below is arm two (with an average reward probability of 0.5), and epsilon has been set to 0.1 – that is, the policy will select the current “best” arm 90% of the time, but explore arms randomly for 10% of the time.


horizon            <- 100
weights            <- c(0.4, 0.5, 0.3)

policy             <- EpsilonGreedyPolicy$new(epsilon = 0.1)
bandit             <- BasicBernoulliBandit$new(weights = weights)
agent              <- Agent$new(policy,bandit, "EG")
simulator          <- Simulator$new(agents      = agent,
                                    horizon     = horizon,
                                    set_seed    = 8,
                                    save_theta  = TRUE,
                                    simulations = 1)
hist <- simulator$run()

UCB1 Policy

Second, a code snippet that runs a "contextual" based UCB1 policy simulation. The full source code can be found here.

An UCB1 bandit policy constructs an optimistic estimate of options with high uncertainty in the form of the Upper Confidence Bound (UCB) of the expected reward of each arm, picking the arm with the highest estimate for every consecutive time step. If the choice is wrong, the optimistic estimate quickly decreases, till another arm has the higher upper confidence estimate.

The best arm in this simulation is arm three (with an average reward probability of 0.8). The dashed line represents the currently highest upper-confidence bound. The dotted line represents the currently highest average mean. Importantly, the arm with the highest upper-confidence bound does not necessarily equal the arm with the highest average.


horizon            <- 50
weights            <- c(0.5, 0.1, 0.8, 0.2, 0.4)

policy             <- UCB1Policy$new()
bandit             <- BasicBernoulliBandit$new(weights = weights)
agent              <- Agent$new(policy,bandit, "UCB1")
simulator          <- Simulator$new(agents      = agent,
                                    horizon     = horizon,
                                    save_theta  = TRUE,
                                    simulations = 1)

hist               <- simulator$run()

Thompson Sampling Policy

And third, a code snippet that runs a "contextual" based Thompson Sampling policy simulation. Full source code again here.

A Thompson Sampling bandit policy works by maintaining a prior distribution on the the mean rewards of its arms. In this, it follows a beta-binomial model with parameters alpha and beta, sampling values for each arm based on their prior distributions, then choosing the arm with the highest sampled value. When an arm is pulled and its binary reward is received, the policy modifies the prior distribution of that particular arm accordingly. This procedure is repeated for the next arm pull.

The best arm in this last simulation is arm three (with an average reward probability of 0.6). The curves represent the current prior distribution on the mean rewards per arm, the vertical lines represent random samples for each arm from these distributions.


horizon            <- 50
weights            <- c(0.5, 0.4, 0.6)

policy             <- ThompsonSamplingPolicy$new()
bandit             <- BasicBernoulliBandit$new(weights = weights)
agent              <- Agent$new(policy,bandit, "TS")
simulator          <- Simulator$new(agents      = agent,
                                    horizon     = horizon,
                                    set_seed    = 22,
                                    save_theta  = TRUE,
                                    simulations = 1)

hist               <- simulator$run()

Leave a Reply