Policies.AdSwitch module¶
The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)
- It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\). 
Warning
This implementation is still experimental!
- 
class Policies.AdSwitch.Phase¶
- Bases: - enum.Enum- Different phases during the AdSwitch algorithm - 
Checking= 2¶
 - 
Estimation= 1¶
 - 
Exploitation= 3¶
 - 
__module__= 'Policies.AdSwitch'¶
 
- 
- 
Policies.AdSwitch.mymean(x)[source]¶
- Simply - numpy.mean()on x if x is non empty, otherwise- 0.0.- >>> np.mean([]) /usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice. 
- 
Policies.AdSwitch.Constant_C1= 1.0¶
- Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large. 
- 
Policies.AdSwitch.Constant_C2= 1.0¶
- Default value for the constant \(C_2\). Should be \(>0\) and as large as possible, but not too large. 
- 
class Policies.AdSwitch.AdSwitch(nbArms, horizon=None, C1=1.0, C2=1.0, *args, **kwargs)[source]¶
- Bases: - Policies.BasePolicy.BasePolicy- The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf) - 
horizon= None¶
- Parameter \(T\) for the AdSwitch algorithm, the horizon of the experiment. TODO try to use - DoublingTrickWrapperto remove the dependency in \(T\) ?
 - 
C1= None¶
- Parameter \(C_1\) for the AdSwitch algorithm. 
 - 
C2= None¶
- Parameter \(C_2\) for the AdSwitch algorithm. 
 - 
phase= None¶
- Current phase, exploration or exploitation. 
 - 
current_exploration_arm= None¶
- Currently explored arm. It cycles uniformly, in step 2. 
 - 
current_exploitation_arm= None¶
- Currently exploited arm. It is \(\overline{a_k}\) in the algorithm. 
 - 
batch_number= None¶
- Number of batch 
 - 
last_restart_time= None¶
- Time step of the last restart (beginning of phase of Estimation) 
 - 
length_of_current_phase= None¶
- Length of the current tests phase, computed as \(s_i\), with - compute_di_pi_si().
 - 
step_of_current_phase= None¶
- Timer inside the current phase. 
 - 
current_best_arm= None¶
- Current best arm, when finishing step 3. Denote \(\overline{a_k}\) in the algorithm. 
 - 
current_worst_arm= None¶
- Current worst arm, when finishing step 3. Denote \(\underline{a_k}\) in the algorithm. 
 - 
current_estimated_gap= None¶
- Gap between the current best and worst arms, ie largest gap, when finishing step 3. Denote \(\widehat{\Delta_k}\) in the algorithm. 
 - 
last_used_di_pi_si= None¶
- Memory of the currently used \((d_i, p_i, s_i)\). 
 - 
all_rewards= None¶
- Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm! 
 - 
read_range_of_rewards(arm, start, end)[source]¶
- Read the - all_rewardsattribute to extract all the rewards for that- arm, obtained between time- start(included) and- end(not included).
 - 
statistical_test(t, t0)[source]¶
- Test if at time \(t\) there is a \(\sigma\), \(t_0 \leq \sigma < t\), and a pair of arms \(a,b\), satisfying this test: \[| \hat{\mu_a}[\sigma,t] - \hat{\mu_b}[\sigma,t] | > \sqrt{\frac{C_1 \log T}{t - \sigma}}.\]- where \(\hat{\mu_a}[t_1,t_2]\) is the empirical mean for arm \(a\) for samples obtained from times \(t \in [t_1,t_2)\). - Return - True, sigmaif the test was satisfied, and the smallest \(\sigma\) that was satisfying the test, or- False, Noneotherwise.
 
 - 
find_Ik()[source]¶
- Follow the algorithm and, with a gap estimate \(\widehat{\Delta_k}\), find \(I_k = \max\{ i : d_i \geq \widehat{\Delta_k} \}\), where \(d_i := 2^{-i}\). There is no need to do an exhaustive search: \[I_k := \lfloor - \log_2(\widehat{\Delta_k}) \rfloor.\]
 - 
__module__= 'Policies.AdSwitch'¶
 
-