Policies.Softmax module¶

The Boltzmann Exploration (Softmax) index policy.

Reference: [Algorithms for the multi-armed bandit problem, V.Kuleshov & D.Precup, JMLR, 2008, §2.1](http://www.cs.mcgill.ca/~vkules/bandits.pdf) and [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
Very similar to Exp3 but uses a Boltzmann distribution. Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://sbubeck.com/SurveyBCB12.pdf)

Policies.Softmax.UNBIASED = False¶: self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).

class Policies.Softmax.Softmax(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The Boltzmann Exploration (Softmax) index policy, with a constant temperature \(\eta_t\).

Reference: [Algorithms for the multi-armed bandit problem, V.Kuleshov & D.Precup, JMLR, 2008, §2.1](http://www.cs.mcgill.ca/~vkules/bandits.pdf) and [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
Very similar to Exp3 but uses a Boltzmann distribution. Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://sbubeck.com/SurveyBCB12.pdf)

__init__(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶: New policy.

unbiased = None¶: Flag

startGame()[source]¶: Nothing special to do.

__str__()[source]¶: -> str

property temperature¶: Constant temperature, \(\eta_t\).

property trusts¶

Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature \(\eta_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= \exp\left( \frac{X_k(t)}{\eta_t N_k(t)} \right) \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k.

choice()[source]¶: One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]¶

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (least probable one).

Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]¶: One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]¶: Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]¶: Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

__module__ = 'Policies.Softmax'¶

class Policies.Softmax.SoftmaxWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Softmax.Softmax

Softmax with fixed temperature \(\eta_t = \eta_0\) chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

__str__()[source]¶: -> str

property temperature¶

Fixed temperature, small, knowing the horizon: \(\eta_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

Cf. Theorem 3.1 case #1 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).

__module__ = 'Policies.Softmax'¶

class Policies.Softmax.SoftmaxDecreasing(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Softmax.Softmax

Softmax with decreasing temperature \(\eta_t\).

__str__()[source]¶: -> str

property temperature¶

Decreasing temperature with the time: \(\eta_t = \sqrt(\frac{\log(K)}{t K})\) (heuristic).

Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).

__module__ = 'Policies.Softmax'¶

class Policies.Softmax.SoftMix(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Softmax.Softmax

Another Softmax with decreasing temperature \(\eta_t\).

__str__()[source]¶: -> str

__module__ = 'Policies.Softmax'¶

property temperature¶

Decreasing temperature with the time: \(\eta_t = c \frac{\log(t)}{t}\) (heuristic).

Cf. [Cesa-Bianchi & Fisher, 1998](http://dl.acm.org/citation.cfm?id=657473).
Default value for is \(c = \sqrt(\frac{\log(K)}{K})\).