Lai & Robbins lower-bound for stochastic bandit with full restart points

First, be sure to be in the main folder, or to have installed SMPyBandits, and import Evaluator from Environment package:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
In [2]:
!pip install SMPyBandits watermark
%load_ext watermark
%watermark -v -m -p SMPyBandits -a "Lilian Besson"
Requirement already satisfied: SMPyBandits in ./venv3/lib/python3.6/site-packages (0.9.4)
Requirement already satisfied: watermark in ./venv3/lib/python3.6/site-packages (1.7.0)
Requirement already satisfied: seaborn in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.9.0)
Requirement already satisfied: scikit-learn in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.20.0)
Requirement already satisfied: numpy in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (1.15.4)
Requirement already satisfied: scipy>0.9 in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (1.1.0)
Requirement already satisfied: joblib in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.13.0)
Requirement already satisfied: scikit-optimize in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.5.2)
Requirement already satisfied: matplotlib>=2 in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (3.0.2)
Requirement already satisfied: ipython in ./venv3/lib/python3.6/site-packages (from watermark) (7.1.1)
Requirement already satisfied: pandas>=0.15.2 in ./venv3/lib/python3.6/site-packages (from seaborn->SMPyBandits) (0.23.4)
Requirement already satisfied: python-dateutil>=2.1 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (2.7.5)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (2.3.0)
Requirement already satisfied: cycler>=0.10 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (1.0.1)
Requirement already satisfied: decorator in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (4.3.0)
Requirement already satisfied: setuptools>=18.5 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (40.6.2)
Requirement already satisfied: backcall in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (0.1.0)
Requirement already satisfied: traitlets>=4.2 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (4.3.2)
Requirement already satisfied: jedi>=0.10 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (0.13.1)
Requirement already satisfied: pexpect; sys_platform != "win32" in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (4.6.0)
Requirement already satisfied: pygments in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (2.2.0)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (2.0.7)
Requirement already satisfied: pickleshare in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (0.7.5)
Requirement already satisfied: pytz>=2011k in ./venv3/lib/python3.6/site-packages (from pandas>=0.15.2->seaborn->SMPyBandits) (2018.7)
Requirement already satisfied: six>=1.5 in ./venv3/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib>=2->SMPyBandits) (1.11.0)
Requirement already satisfied: ipython-genutils in ./venv3/lib/python3.6/site-packages (from traitlets>=4.2->ipython->watermark) (0.2.0)
Requirement already satisfied: parso>=0.3.0 in ./venv3/lib/python3.6/site-packages (from jedi>=0.10->ipython->watermark) (0.3.1)
Requirement already satisfied: ptyprocess>=0.5 in ./venv3/lib/python3.6/site-packages (from pexpect; sys_platform != "win32"->ipython->watermark) (0.6.0)
Requirement already satisfied: wcwidth in ./venv3/lib/python3.6/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->watermark) (0.1.7)
Info: Using the Jupyter notebook version of the tqdm() decorator, tqdm_notebook() ...
Lilian Besson 

CPython 3.6.6
IPython 7.1.1

SMPyBandits 0.9.4

compiler   : GCC 8.0.1 20180414 (experimental) [trunk revision 259383
system     : Linux
release    : 4.15.0-38-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
In [3]:
# Local imports
from SMPyBandits.Environment import Evaluator, tqdm
from SMPyBandits.Environment.plotsettings import legend, makemarkers

We also need arms, for instance Bernoulli-distributed arm:

In [4]:
# Import arms
from SMPyBandits.Arms import Bernoulli

And finally we need some single-player Reinforcement Learning algorithms:

In [7]:
# Import algorithms
from SMPyBandits.Policies import *
In [8]:
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (12.4, 7)

Creating the problem

Parameters for the simulation

  • $T = 20000$ is the time horizon,
  • $N = 40$ is the number of repetitions,
  • N_JOBS = 4 is the number of cores used to parallelize the code.
In [9]:
HORIZON = 20000
REPETITIONS = 40
N_JOBS = 4

Some MAB problem with Bernoulli arms

We consider in this example $3$ problems, with Bernoulli arms, of different means.

In [10]:
ENVIRONMENTS = [  # 1)  Bernoulli arms
        {   # A very easy problem, but it is used in a lot of articles
            "arm_type": Bernoulli,
            "params": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
        }
    ]

Some RL algorithms

We compare some policies that use the DoublingTrickWrapper policy, with a common growing scheme.

In [11]:
NEXT_HORIZONS = [
    # next_horizon__arithmetic,
    next_horizon__geometric,
    # next_horizon__exponential,
    # next_horizon__exponential_slow,
    next_horizon__exponential_generic
]
In [12]:
POLICIES = [
    # --- Doubling trick algorithm
    {
        "archtype": DoublingTrickWrapper,
        "params": {
            "next_horizon": next_horizon,
            "full_restart": full_restart,
            "policy": policy,
        }
    }
    for policy in [
        UCBH,
        MOSSH,
        klUCBPlusPlus,
        ApproximatedFHGittins,
    ]
    for full_restart in [
        True,
        # False,
    ]
    for next_horizon in NEXT_HORIZONS
]

Complete configuration for the problem:

In [13]:
configuration = {
    # --- Duration of the experiment
    "horizon": HORIZON,
    # --- Number of repetition of the experiment (to have an average)
    "repetitions": REPETITIONS,
    # --- Parameters for the use of joblib.Parallel
    "n_jobs": N_JOBS,    # = nb of CPU cores
    "verbosity": 6,      # Max joblib verbosity
    # --- Arms
    "environment": ENVIRONMENTS,
    # --- Algorithms
    "policies": POLICIES,
}
configuration
Out[13]:
{'horizon': 20000,
 'repetitions': 40,
 'n_jobs': 4,
 'verbosity': 6,
 'environment': [{'arm_type': SMPyBandits.Arms.Bernoulli.Bernoulli,
   'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}],
 'policies': [{'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>),
    'full_restart': True,
    'policy': SMPyBandits.Policies.UCBH.UCBH}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': <function SMPyBandits.Policies.DoublingTrickWrapper.next_horizon__exponential_generic(i, horizon)>,
    'full_restart': True,
    'policy': SMPyBandits.Policies.UCBH.UCBH}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>),
    'full_restart': True,
    'policy': SMPyBandits.Policies.MOSSH.MOSSH}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': <function SMPyBandits.Policies.DoublingTrickWrapper.next_horizon__exponential_generic(i, horizon)>,
    'full_restart': True,
    'policy': SMPyBandits.Policies.MOSSH.MOSSH}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>),
    'full_restart': True,
    'policy': SMPyBandits.Policies.klUCBPlusPlus.klUCBPlusPlus}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': <function SMPyBandits.Policies.DoublingTrickWrapper.next_horizon__exponential_generic(i, horizon)>,
    'full_restart': True,
    'policy': SMPyBandits.Policies.klUCBPlusPlus.klUCBPlusPlus}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>),
    'full_restart': True,
    'policy': SMPyBandits.Policies.ApproximatedFHGittins.ApproximatedFHGittins}},
  {'archtype': SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper,
   'params': {'next_horizon': <function SMPyBandits.Policies.DoublingTrickWrapper.next_horizon__exponential_generic(i, horizon)>,
    'full_restart': True,
    'policy': SMPyBandits.Policies.ApproximatedFHGittins.ApproximatedFHGittins}}]}

Creating the Evaluator object

In [14]:
evaluation = Evaluator(configuration)
Number of policies in this comparison: 8
Time horizon: 20000
Number of repetitions: 40
Sampling rate for plotting, delta_t_plot: 50
Number of jobs for parallelization: 4
Using this dictionary to create a new environment:
 {'arm_type': <class 'SMPyBandits.Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}


Creating a new MAB problem ...
  Reading arms of this MAB problem from a dictionnary 'configuration' = {'arm_type': <class 'SMPyBandits.Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]} ...
 - with 'arm_type' = <class 'SMPyBandits.Arms.Bernoulli.Bernoulli'>
 - with 'params' = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
 - with 'arms' = [B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)]
 - with 'means' = [0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
 - with 'nbArms' = 9
 - with 'maxArm' = 0.9
 - with 'minArm' = 0.1

This MAB problem has: 
 - a [Lai & Robbins] complexity constant C(mu) = 7.52 ... 
 - a Optimal Arm Identification factor H_OI(mu) = 48.89% ...
 - with 'arms' represented as: $[B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)^*]$
Number of environments to try: 1

Solving the problem

Now we can simulate all the $3$ environments. That part can take some time.

In [15]:
for envId, env in tqdm(enumerate(evaluation.envs), desc="Problems"):
    # Evaluate just that env
    evaluation.startOneEnv(envId, env)

Evaluating environment: MAB(nbArms: 9, arms: [B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)], minArm: 0.1, maxArm: 0.9)
- Adding policy #1 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.UCBH.UCBH'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][0]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.UCBH.UCBH'>}} ...
- Adding policy #2 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.UCBH.UCBH'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][1]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.UCBH.UCBH'>}} ...
- Adding policy #3 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.MOSSH.MOSSH'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][2]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.MOSSH.MOSSH'>}} ...
- Adding policy #4 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.MOSSH.MOSSH'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][3]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.MOSSH.MOSSH'>}} ...
- Adding policy #5 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.klUCBPlusPlus.klUCBPlusPlus'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][4]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.klUCBPlusPlus.klUCBPlusPlus'>}} ...
- Adding policy #6 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.klUCBPlusPlus.klUCBPlusPlus'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][5]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.klUCBPlusPlus.klUCBPlusPlus'>}} ...
- Adding policy #7 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.ApproximatedFHGittins.ApproximatedFHGittins'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][6]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': CPUDispatcher(<function next_horizon__geometric at 0x7f18c1a70620>), 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.ApproximatedFHGittins.ApproximatedFHGittins'>}} ...
- Adding policy #8 = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.ApproximatedFHGittins.ApproximatedFHGittins'>}} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][7]' = {'archtype': <class 'SMPyBandits.Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic at 0x7f18c1a70ea0>, 'full_restart': True, 'policy': <class 'SMPyBandits.Policies.ApproximatedFHGittins.ApproximatedFHGittins'>}} ...



- Evaluating policy #1/8: DT($T_i = 200 \times 2^i$)[UCB-H($\alpha=4$)] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    5.5s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   23.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   23.1s finished


- Evaluating policy #2/8: DT($T_i = (200/2) 2^{2^i}$)[UCB-H($\alpha=4$)] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    5.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   21.6s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   21.6s finished


- Evaluating policy #3/8: DT($T_i = 200 \times 2^i$)[MOSS-H] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    2.7s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   14.3s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   14.3s finished


- Evaluating policy #4/8: DT($T_i = (200/2) 2^{2^i}$)[MOSS-H] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    2.6s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   14.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   14.0s finished


- Evaluating policy #5/8: DT($T_i = 200 \times 2^i$)[kl-UCB$^{++}$] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    6.1s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   35.7s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   35.7s finished


- Evaluating policy #6/8: DT($T_i = (200/2) 2^{2^i}$)[kl-UCB$^{++}$] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    6.9s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   33.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   33.1s finished


- Evaluating policy #7/8: DT($T_i = 200 \times 2^i$)[ApprFHG] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    3.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   25.4s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   25.4s finished


- Evaluating policy #8/8: DT($T_i = (200/2) 2^{2^i}$)[ApprFHG] ...
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    3.4s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   18.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed:   18.1s finished

Plotting the results

And finally, visualize them, with the plotting method of a Evaluator object:

In [16]:
def plotAll(evaluation, envId):
    evaluation.printFinalRanking(envId)
    fig = evaluation.plotRegrets(envId)
    # evaluation.plotRegrets(envId, semilogx=True)
    # evaluation.plotRegrets(envId, meanRegret=True)
    # evaluation.plotBestArmPulls(envId)
    return fig
In [17]:
fig = plotAll(evaluation, 0)
Final ranking for this environment #0 :
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[kl-UCB$^{++}$]'	was ranked	1 / 8 for this simulation (last regret = 144.11).
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[MOSS-H]'	was ranked	2 / 8 for this simulation (last regret = 213.66).
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[ApprFHG]'	was ranked	3 / 8 for this simulation (last regret = 239.97).
- Policy 'DT($T_i = 200 \times 2^i$)[kl-UCB$^{++}$]'	was ranked	4 / 8 for this simulation (last regret = 323.72).
- Policy 'DT($T_i = 200 \times 2^i$)[MOSS-H]'	was ranked	5 / 8 for this simulation (last regret = 500.27).
- Policy 'DT($T_i = 200 \times 2^i$)[ApprFHG]'	was ranked	6 / 8 for this simulation (last regret = 539.85).
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[UCB-H($\alpha=4$)]'	was ranked	7 / 8 for this simulation (last regret = 636.46).
- Policy 'DT($T_i = 200 \times 2^i$)[UCB-H($\alpha=4$)]'	was ranked	8 / 8 for this simulation (last regret = 1363.4).

This MAB problem has: 
 - a [Lai & Robbins] complexity constant C(mu) = 7.52 for 1-player problem... 
 - a Optimal Arm Identification factor H_OI(mu) = 48.89% ...
Warning: forcing to use putatright = True because there is 9 items in the legend.

Visualisation the lower-bound for algorithms that restart at breaking points

In [18]:
DEFAULT_FIRST_HORIZON = 100

def lower_bound_with_breakpoints(next_horizon, horizon, env,
                                 first_horizon=DEFAULT_FIRST_HORIZON,
                                 fig=None, marker=None):
    points, gap = breakpoints(next_horizon, first_horizon, horizon)
    X = np.arange(1, horizon)
    Y = np.log(X)
    # Durty estimate
    for estimate_horizon in points:
        if estimate_horizon <= horizon:
            before_breakpoint = np.max(np.where(X == estimate_horizon - 1)[0])
            lower_bound_before_breakpoint = Y[before_breakpoint]
            print("At time {}, lowerbound was {}".format(estimate_horizon, lower_bound_before_breakpoint))
            after = np.where(X >= estimate_horizon)
            Y[after] = np.log(X[after] - X[before_breakpoint]) + lower_bound_before_breakpoint
    if fig is None:  # new figure if needed
        fig, ax = plt.subplots()
        ax.set_xlabel("Time steps t=1..T, $T = {}$".format(horizon))
        ax.set_ylabel("Regret lower-bound")
        ax.set_title("Lai & Robbins lower-bound for problem with $K={}$ arms and $C_K={:.3g}$\nAnd doubling trick with restart points ({})".format(env.nbArms, env.lowerbound(), next_horizon.__latex_name__))
    else:
        ax = fig.axes[0]
        # https://stackoverflow.com/a/26845924/
        ax_legend = ax.legend()
        ax_legend.remove()
    complexity = env.lowerbound()
    ax.plot(X, complexity * Y,
            'k--' if marker is None else '{}k--'.format(marker),
            markevery=(0.0, 0.1),
            label="LB, DT restart ({})".format(next_horizon.__latex_name__))
    legend(fig=fig)
    fig.show()
    return fig
In [19]:
_ = lower_bound_with_breakpoints(next_horizon__exponential_generic, HORIZON, evaluation.envs[0])
At time 100, lowerbound was 4.59511985013459
At time 200, lowerbound was 9.20029003612268
At time 400, lowerbound was 14.498607402670718
At time 1600, lowerbound was 21.58868423844681
Warning: forcing to use putatright = False because there is 1 items in the legend.
/home/lilian/ownCloud/owncloud.crans.org/Crans/These_2016-17/src/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/matplotlib/legend.py:508: UserWarning: Automatic legend placement (loc="best") not implemented for figure legend. Falling back on "upper right".
  warnings.warn('Automatic legend placement (loc="best") not '
/home/lilian/ownCloud/owncloud.crans.org/Crans/These_2016-17/src/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

Seeing the lower-bound on the regret plot

In [20]:
fig = plotAll(evaluation, 0)
Final ranking for this environment #0 :
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[kl-UCB$^{++}$]'	was ranked	1 / 8 for this simulation (last regret = 144.11).
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[MOSS-H]'	was ranked	2 / 8 for this simulation (last regret = 213.66).
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[ApprFHG]'	was ranked	3 / 8 for this simulation (last regret = 239.97).
- Policy 'DT($T_i = 200 \times 2^i$)[kl-UCB$^{++}$]'	was ranked	4 / 8 for this simulation (last regret = 323.72).
- Policy 'DT($T_i = 200 \times 2^i$)[MOSS-H]'	was ranked	5 / 8 for this simulation (last regret = 500.27).
- Policy 'DT($T_i = 200 \times 2^i$)[ApprFHG]'	was ranked	6 / 8 for this simulation (last regret = 539.85).
- Policy 'DT($T_i = (200/2) 2^{2^i}$)[UCB-H($\alpha=4$)]'	was ranked	7 / 8 for this simulation (last regret = 636.46).
- Policy 'DT($T_i = 200 \times 2^i$)[UCB-H($\alpha=4$)]'	was ranked	8 / 8 for this simulation (last regret = 1363.4).

This MAB problem has: 
 - a [Lai & Robbins] complexity constant C(mu) = 7.52 for 1-player problem... 
 - a Optimal Arm Identification factor H_OI(mu) = 48.89% ...
Warning: forcing to use putatright = True because there is 9 items in the legend.
In [21]:
markers = makemarkers(len(NEXT_HORIZONS))
In [22]:
for i, next_horizon in enumerate(NEXT_HORIZONS):
    fig = lower_bound_with_breakpoints(next_horizon, HORIZON, evaluation.envs[0], fig=fig, marker=markers[i])
At time 100, lowerbound was 4.59511985013459
At time 200, lowerbound was 9.20029003612268
At time 400, lowerbound was 14.498607402670718
At time 800, lowerbound was 20.4900719497787
At time 1600, lowerbound was 27.174683677446627
At time 3200, lowerbound was 34.5524425856745
At time 6400, lowerbound was 42.62334867446232
At time 12800, lowerbound was 51.38740194381008
Warning: forcing to use putatright = False because there is 0 items in the legend.
At time 100, lowerbound was 4.59511985013459
At time 200, lowerbound was 9.20029003612268
At time 400, lowerbound was 14.498607402670718
At time 1600, lowerbound was 21.58868423844681
Warning: forcing to use putatright = False because there is 0 items in the legend.
/home/lilian/ownCloud/owncloud.crans.org/Crans/These_2016-17/src/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/matplotlib/legend.py:508: UserWarning: Automatic legend placement (loc="best") not implemented for figure legend. Falling back on "upper right".
  warnings.warn('Automatic legend placement (loc="best") not '
/home/lilian/ownCloud/owncloud.crans.org/Crans/These_2016-17/src/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())
In [23]:
fig
Out[23]:

Conclusion

That's it for today, folks!