Policies.Softmax module¶
The Boltzmann Exploration (Softmax) index policy.
Reference: [Algorithms for the multi-armed bandit problem, V.Kuleshov & D.Precup, JMLR, 2008, §2.1](http://www.cs.mcgill.ca/~vkules/bandits.pdf) and [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
Very similar to Exp3 but uses a Boltzmann distribution. Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://sbubeck.com/SurveyBCB12.pdf)
-
Policies.Softmax.
UNBIASED
= False¶ self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just rt, or unbiased estimators, rt/trustst.
-
class
Policies.Softmax.
Softmax
(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The Boltzmann Exploration (Softmax) index policy, with a constant temperature ηt.
Reference: [Algorithms for the multi-armed bandit problem, V.Kuleshov & D.Precup, JMLR, 2008, §2.1](http://www.cs.mcgill.ca/~vkules/bandits.pdf) and [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
Very similar to Exp3 but uses a Boltzmann distribution. Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://sbubeck.com/SurveyBCB12.pdf)
-
unbiased
= None¶ Flag
-
property
temperature
¶ Constant temperature, ηt.
-
property
trusts
¶ Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature ηt.
trusts′k(t+1)=exp(Xk(t)ηtNk(t))trusts(t+1)=trusts′(t+1)/K∑k=1trusts′k(t+1).If Xk(t)=∑tσ=11(A(σ)=k)rk(σ) is the sum of rewards from arm k.
-
choice
()[source]¶ One random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceWithRank
(rank=1)[source]¶ Multiple (rank >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
, and select the last one (least probable one).Note that if not enough entries in the trust vector are non-zero, then
choice()
is called instead (rank is ignored).
-
choiceFromSubSet
(availableArms='all')[source]¶ One random selection, from availableArms, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceMultiple
(nb=1)[source]¶ Multiple (nb >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
estimatedOrder
()[source]¶ Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.
-
__module__
= 'Policies.Softmax'¶
-
class
Policies.Softmax.
SoftmaxWithHorizon
(nbArms, horizon, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Softmax.Softmax
Softmax with fixed temperature ηt=η0 chosen with a knowledge of the horizon.
-
horizon
= None¶ Parameter T = known horizon of the experiment.
-
property
temperature
¶ Fixed temperature, small, knowing the horizon: ηt=√(2log(K)TK) (heuristic).
Cf. Theorem 3.1 case #1 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Softmax'¶
-
-
class
Policies.Softmax.
SoftmaxDecreasing
(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Softmax.Softmax
Softmax with decreasing temperature ηt.
-
property
temperature
¶ Decreasing temperature with the time: ηt=√(log(K)tK) (heuristic).
Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Softmax'¶
-
property
-
class
Policies.Softmax.
SoftMix
(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Softmax.Softmax
Another Softmax with decreasing temperature ηt.
-
__module__
= 'Policies.Softmax'¶
-
property
temperature
¶ Decreasing temperature with the time: ηt=clog(t)t (heuristic).
Cf. [Cesa-Bianchi & Fisher, 1998](http://dl.acm.org/citation.cfm?id=657473).
Default value for is c=√(log(K)K).
-