Policies.Softmax module

The Boltzmann Exploration (Softmax) index policy.

Policies.Softmax.UNBIASED = False

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just rt, or unbiased estimators, rt/trustst.

class Policies.Softmax.Softmax(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Boltzmann Exploration (Softmax) index policy, with a constant temperature ηt.

__init__(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

New policy.

unbiased = None

Flag

startGame()[source]

Nothing special to do.

__str__()[source]

-> str

property temperature

Constant temperature, ηt.

property trusts

Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature ηt.

trustsk(t+1)=exp(Xk(t)ηtNk(t))trusts(t+1)=trusts(t+1)/Kk=1trustsk(t+1).

If Xk(t)=tσ=11(A(σ)=k)rk(σ) is the sum of rewards from arm k.

choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (least probable one).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Softmax with fixed temperature ηt=η0 chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter T = known horizon of the experiment.

__str__()[source]

-> str

property temperature

Fixed temperature, small, knowing the horizon: ηt=(2log(K)TK) (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxDecreasing(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Softmax with decreasing temperature ηt.

__str__()[source]

-> str

property temperature

Decreasing temperature with the time: ηt=(log(K)tK) (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftMix(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Another Softmax with decreasing temperature ηt.

__str__()[source]

-> str

__module__ = 'Policies.Softmax'
property temperature

Decreasing temperature with the time: ηt=clog(t)t (heuristic).