What Doubling Tricks Can and Can’t
Do for Multi-Armed Bandits

Lilian Besson

^{1, 2}

, Emilie Kaufmann

^{2}

^{1}

CentraleSupélec, IETR–SCEE, Cesson-Sévigné, France, Lilian.Besson AT CentraleSupelec.fr

^{2}

CNRS

&

CRIStAL, Univ. Lille 1, Inria SequeL, Lille, France, Emilie.Kaufmann AT Univ-Lille1.fr

This work is supported by the French National Research Agency (ANR), under the project BADASS (grant coded: N ANR-16-CE40-0002), by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay. Thanks to Odalric-Ambrym Maillard and Florian Strub at Inria Lille for useful discussions.

This document was generated with the open-source software Engrafo. It is nothing but experimental, please refer to the PDF version of our article instead.

Abstract

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon $T$ of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the “Doubling Trick”. In the contexts of adversary or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (adversary) bounds in $R_{T} = O (\sqrt{T})$ but cannot conserve (stochastic) bounds in $R_{T} = O (log T)$ . We give insights as to why exponential doubling tricks may be better, as they conserve bounds in $R_{T} = O (log T)$ , and are close to conserving bounds in $R_{T} = O (\sqrt{T})$ .

Keywords:

Multi-Armed Bandits; Reinforcement Learning; Online Learning; Anytime Algorithms; Sequential Learning; Doubling Trick.

1Introduction

Multi-Armed Bandit (MAB) problems are well-studied sequential decision making problems in which an agent repeatedly chooses an action (the “arm” of a one-armed bandit) in order to maximize some total reward [22]. Initial motivation for their study came from the modeling of clinical trials, as early as 1933 with the seminal work of [25]. In this example, arms correspond to different treatments with unknown, random effect. Since then, MAB models have been proved useful for many more applications, that range from cognitive radio [14] to online content optimization (news article recommendation [19], online advertising [10] or A/B Testing [16]), or portfolio optimization [23].

While the number of patients involved in a clinical study (and thus the number of treatments to select) is often decided in advance, in other contexts the total number of decisions to make (the horizon $T$ ) is unknown. It may correspond to the total number of visitors of a website optimizing its displays for a certain period of time, or to the number of attempted communications in a smart radio device. Hence in such cases, it is crucial to devise anytime algorithms, that is algorithms that do no rely on the knowledge of this horizon $T$ to sequentially select arms. A general way to turn any base algorithm into an anytime algorithm is the use of the so-called Doubling Trick, first proposed by [4], that consists in repeatedly running the base algorithm with increasing horizons. Motivated by the frequent use of this technique and the absence of a generic study of its effect on the algorithm’s efficiency, this paper investigates in details two families of doubling sequences (geometric and exponential), and shows that the former should be avoided for stochastic problems.

More formally, a MAB model is a set of $K$ arms, each arm $k$ being associated to a (unknown) reward stream $(Y_{k, t})_{t \in N}$ . Fix $T$ a finite (possibly unknown) horizon. At each time step $t \in {1, \dots, T}$ an agent selects an arm $A (t) \in {1, \dots, K}$ and receives as a reward the current value of the associated reward stream, $r (t) := Y_{A (t), t}$ . The agent’s decision strategy (or bandit algorithm) $A_{T} := (A (t), t \in {1, \dots, T})$ is such that $A (t)$ can only rely on the past observations $A (1), r (1), \dots, A (t - 1), r (t - 1)$ , on external randomness and (possibly) on the knowledge of the horizon $T$ . The objective of the agent is to find an algorithm $A$ that maximizes the expected cumulated rewards, where the expectation is taken over the possible randomness used by the algorithm and the possible randomness in the generation of the rewards stream. In the oblivious case, in which the reward streams are independent of the algorithm’s choice, this is equivalent to minimizing the regret, defined as

R_{T} (A_{T}) := max k \in {1, \dots, K} E [T \sum t = 1 (Y_{k, t} - Y_{A (t), t})] .

This quantity, referred to as pseudo-regret in [7], quantifies the difference between the expected cumulated reward of the best fixed action, and that of the strategy $A_{T}$ . For the general adversarial bandit problem [6], in which the rewards streams are arbitrary (picked by an adversary), a worst-case lower bound has been given. It says that for every algorithm, there exists (stochastic) reward streams such that the regret is larger than $(1 / 20) \sqrt{K T}$ [6]. Besides, the $E X P$ $3$ algorithm has been shown to have a regret of order $\sqrt{K T log (K)}$ .

Much smaller regret may be obtained in stochastic MAB models, in which the reward stream from each arm $k$ is assumed to be i.i.d., from some (unknown) distribution $ν_{k}$ , with mean $μ_{k}$ . In that case, various algorithms have been proposed with problem-dependent regret upper bounds of the form $C (ν) log (T),$ where $C (ν)$ is a constant that only depend on the arms distributions. Different assumptions on the arms distributions lead to different problem-dependent constants. In particular, under some parametric assumptions (e.g., Gaussian distributions, exponential families), asymptotically optimal algorithms have been proposed and analyzed (e.g., $k l - U C B$ [8] or Thompson sampling [1]), for which the constant $C (ν)$ obtained in the regret upper bound matches exactly that of the lower bound given by [17]. Under the non-parametric assumption that the $ν_{k}$ are bounded in $[0, 1]$ , the regret of the $U C B 1$ algorithm [5] is of the above form with $C (ν) = 8 \times \sum_{k : μ_{k} > μ^{*}} (μ^{*} - μ_{k})^{- 1}$ , where $μ^{*} = {max}_{k} μ_{k}$ is the mean of the best arm. Like in this last example, all the available constants $C (ν)$ become very large on “hard” instances, in which some arms are very close to the best arm. On such instances, $C (ν) log (T)$ may be much larger than the worst-case $(1 / 20) \sqrt{K T}$ , and distribution-independent guarantees may actually be preferred.

The $M O S S$ algorithm, proposed by [2], is the first stochastic bandit algorithm to enjoy a problem-dependent logarithmic regret and to be optimal in a minimax sense, as its regret is proved to be upper bounded by $\sqrt{K T}$ , for bandit models with rewards in $[0, 1]$ . However the corresponding constant $C (ν)$ is proportional to $K / Δ_{min}$ , where $Δ_{min} = {min}_{k} (μ^{*} - μ_{k})$ is the minimal gap, which worsen the constant of $U C B 1$ . Another drawback of $M O S S$ is that it is not anytime. These two shortcoming have been overcame recently in two different works. On the one hand, the $M O S S - a n y t i m e$ algorithm [11] is minimax optimal and anytime, but its problem-dependent regret does not improve that of $M O S S$ . On the other hand, the $k l - {U C B}^{+ +}$ algorithm [21] is simultaneously minimax optimal and asymptotically optimal (i.e., it has the best problem-dependent constant $C (ν)$ ), but it is not anytime. A natural question is thus to know whether a Doubling Trick could overcome this limitation.

This question is the starting point of our comprehensive study of the Doubling Trick: can a single Doubling Trick be used to preserve both problem-dependent (logarithmic) regret and minimax (square-root) regret? We answer this question in the negative, by showing that two different types of Doubling Trick may actually be needed. In this paper, we investigate how algorithms enjoying regret guarantees of the generic form

\begin{matrix} \forall T \geq 1, R_{T} (A_{T}) \leq c T^{γ} (log (T))^{δ} + o (T^{γ} (log (T))^{δ}) \\ (2) \end{matrix}

may be turned into an anytime algorithm enjoying similar regret guarantees with an appropriate Doubling Trick. This does not come for free, and we exhibit a “price of Doubling Trick”, that is a constant factor larger than $1$ , referred to as a constant manipulative loss.

Outline The rest of the paper is organized as follows. The Doubling Trick is formally defined in Section 2, along with a generic tool for its analysis. In Section 3, we present upper and lower bounds on the regret of algorithms to which a geometric Doubling Trick is applied. Section 4 investigates regret guarantees that can be obtained for a “faster” exponential Doubling Trick. Experimental results are then reported in Section 5. Complementary elements of proofs are deferred to the appendix.

2Doubling Tricks

The Doubling Trick, denoted by $D T$ , is a general procedure to convert a (possibly non-anytime) algorithm into an anytime algorithm. It is formally stated below as Algorithm Figure 1 and depends on a non-decreasing diverging doubling sequence $(T_{i})_{i \in N}$ (i.e., $T_{i} \to \infty$ for $i \to \infty$ ). $D T$ fully restarts the underlying algorithm $A$ at the beginning of each new sequence (at $t = T_{i} + 1$ ), and run this algorithm on a sequence of length ( $T_{i} - T_{i - 1}$ ).

Figure 1: The Generic Doubling Trick Algorithm, \mathcal{A}' = \ensuremath{\mathcal{DT}}(\mathcal{A}, (T_i)_{i\in\mathbb{N}}). — Figure 1: The Generic Doubling Trick Algorithm, $A^{'} = D T (A, (T_{i})_{i \in N})$ .

Related work. The Doubling Trick is a well known idea in online learning, that can be traced back to [4]. In the literature, the term Doubling Trick usually refers to the geometric sequence $T_{i} = 2^{i}$ , in which the horizon is actually doubling, that was popularized by [9] in the context of adversarial bandits. Specific doubling tricks have also been used for stochastic bandits, for example in the work of [3], which uses the doubling sequence $T_{i} = 2^{2^{i}}$ to turn the $U C B - R$ algorithm into an anytime algorithm.

Elements of regret analysis. For a sequence $(T_{i})_{i \in N}$ , with $T_{i} \in N$ for all $i$ , we denote $T_{- 1} = 0$ , and $T_{0}$ is always taken non-zero, $T_{0} > 0$ (i.e., $T_{0} \in N^{*}$ ). We only consider non-decreasing and diverging sequences (that is, $\forall i, T_{i + 1} \geq T_{i}$ , and $T_{i} \to \infty$ for $i \to \infty$ ).

Definition 1: Last Term $L_{t}$ For a non-decreasing diverging sequence $(T_{i})_{i \in N}$ and $T \in N$ , we can define $L_{T} ((T_{i})_{i \in N})$ by

\forall T \geq 1, L_{T} ((T_{i})_{i \in N}) := min {i \in N : T_{i} > T} .

It is simply denoted $L_{T}$ when there is no ambiguity (e.g., if the doubling sequence is chosen).

$D T (A)$ reinitializes its underlying algorithm $A$ at each time $T_{i}$ , and in generality the total regret is upper bounded by the regret on each sequence ${T_{i}, \dots, T_{i + 1} - 1}$ . By considering the last partial sequence ${T_{L_{T} - 1}, \dots, T - 1}$ , this splitting can be used to get a generic upper bound by taking into account a larger last sequence (up to $T_{L_{T}} - 1$ ). And for stochastic bandit models, the i.i.d. hypothesis on the rewards streams makes the splitting on each sequence an equality, so we can also get a lower bound by excluding the last partial sequence. Lemma 1 is proved in Appendix A.1.

Lemma 1: Regret Lower and Upper Bounds for $D T$ For any bandit model and algorithm $A$ and horizon $T$ , one has the generic upper bound

\begin{matrix} \begin{matrix} R_{T} (D T (A, (T_{i})_{i \in N})) \leq L_{T} \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T_{i} - T_{i - 1}}) . (L B) \end{matrix} \\ (3) \end{matrix}

Under a stochastic bandit model, one has furthermore the lower bound

\begin{matrix} \begin{matrix} R_{T} (D T (A, (T_{i})_{i \in N})) \geq L_{T - 1} \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T_{i} - T_{i - 1}}) . (U B) \end{matrix} \\ (4) \end{matrix}

As one can expect, the key to obtain regret guarantees for a Doubling Trick algorithm is to choose correctly the doubling sequence $(T_{i})_{i \in N}$ . Empirically, one can verify that sequences with slow growth gives terrible results, and for example using an arithmetic progression typically gives a linear regret. Building on this result, we will prove that if $A$ satisfies a certain regret bound ( $R_{T} = O (T^{γ})$ , $O ((log T)^{δ})$ , or $O (T^{γ} (log T)^{δ})$ ) then an appropriate anytime version of $A$ with a certain doubling trick can conserve the regret bound, with an explicit constant multiplicative loss $ℓ > 1$ . In this paper, we study in depth two families of sequences: first geometric and then exponential growths.

3What the Geometric Doubling Trick Can and Can’t Do

We define geometric doubling sequences, and prove that they can be used to conserve bounds in $O (T^{γ} (log T)^{δ})$ with $γ > 0$ but cannot be used to conserve bounds in $O ((log T)^{δ})$ .

Definition 2: Geometric Growth For $b \in R$ , $b > 1$ and $T_{0} \in N^{*}$ , the sequence defined by $T_{i} = ⌊ T_{0} b^{i} ⌋$ is non-decreasing and diverging, and satisfies

\begin{matrix} \begin{matrix} \forall T < T_{0}, L_{T} = 0, and \forall T \geq T_{0}, L_{T} = ⌈ {log}_{b} (\frac{T}{T_{0}}) ⌉, \forall i > 0, T_{0} (b - 1) b^{i - 1} - 1 \leq T_{i} - T_{i - 1} \leq 1 + T_{0} (b - 1) b^{i - 1} . \end{matrix} \\ (5) \end{matrix}

Asymptotically for $i$ and $T \to \infty$ , $T_{i} = O (b^{i})$ and $L_{T} \sim {log}_{b} (T) = O (log T)$ .

3.1Conserving a Regret Upper Bound with Geometric Horizons

A geometric doubling sequence allows to conserve a minimax bound (i.e., $R_{T} = O (\sqrt{T})$ ). It was suggested, for instance, in [9]. We generalize this result in the following theorem, proved in Appendix A.2, by extending it to bounds of the form $T^{γ} (log T)^{δ}$ instead of just $\sqrt{T}$ , for any $0 < γ < 1$ and $δ \geq 0$ . Note that no distinction is done on the case $δ = 0$ neither in the expression of the constant loss, nor in the proof.

Theorem 1 If an algorithm $A$ satisfies $R_{T} (A_{T}) \leq c T^{γ} (log T)^{δ} + f (T)$ , for $0 < γ < 1$ , $δ \geq 0$ and for $c > 0$ , and an increasing function $f (t) = o (t^{γ} (log t)^{δ})$ (at $t \to \infty$ ), then the anytime version $A^{'} := D T (A, (T_{i})_{i \in N})$ with the geometric sequence $(T_{i})_{i \in N}$ of parameters $T_{0} \in N^{*}, b > 1$ (i.e., $T_{i} = ⌊ T_{0} b^{i} ⌋$ ) with the condition $T_{0} (b - 1) > 1$ if $δ > 0$ , satisfies,

\begin{matrix} R_{T} (A^{'}) \leq ℓ (γ, δ, T_{0}, b) c T^{γ} {(log T)}^{δ} + g (T), \\ (6) \end{matrix}

with a increasing function $g (t) = o (t^{γ} (log t)^{δ})$ , and a constant loss $ℓ (γ, δ, T_{0}, b) > 1$ ,

\begin{matrix} ℓ (γ, δ, T_{0}, b) := {(\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} \times \frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1} . \\ (7) \end{matrix}

For a fixed $γ$ and $δ$ , minimizing $ℓ (γ, δ, T_{0}, b)$ does not always give a unique solution. On the one hand, if $γ ≳ 0.384$ , there is a unique solution $b^{*} (γ) > 1$ minimizing the $\frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1}$ term, solution of $b^{γ + 1} - 2 b + 1 = 0$ , but without a closed form if $γ$ is unknown. On the other hand, for any $γ$ , the term depending on $δ$ tends quickly to $1$ when $T_{0}$ increases.

Practical considerations. Empirically, when $γ$ and $δ$ are fixed and known, there is no need to minimize $ℓ$ jointly. It can be minimized separately by first minimizing $\frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1}$ , that is by solving $b^{γ + 1} - 2 b + 1 = 0$ numerically (e.g., with Newton’s method), and then by taking $T_{0}$ large enough so that the other term is close enough to $1$ .

For the usual case of $γ = 1 / 2$ and $δ = 0$ (i.e., bounds in $\sqrt{T}$ ), the optimal choice of $b$ is $\frac{3 + \sqrt{5}}{2}$ leading to $ℓ ≃ 3.33$ , and the usual choice of $b = 2$ gives $ℓ ≃ 3.41$ (see Corollary 1 in appendix). Any large enough $T_{0}$ gives similar performance, and empirically $T_{0} ≫ K$ is preferred, as most algorithms explore each arm once in their first steps (e.g., $T_{0} = 200$ for $K = 9$ for our experiments).

3.2Problem-Dependent Regret Lower Bound with Geometric Horizons

We observe that the constant loss in Eq. from the previous Theorem 1 blows up when $γ$ goes to zero, giving the intuition that no geometric doubling trick could be used to preserve a logarithmic bound (i.e., with $γ = 0$ , $δ > 0$ ). This is confirmed by the lower bound given below.

Theorem 2 For stochastic models, if $A$ satisfies $R_{T} (A_{T}) \geq c (log T)^{δ}$ , for $c > 0$ and $δ > 0$ , then the anytime version $A^{'} := D T (A, (T_{i})_{i \in N})$ with the geometric sequence $(T_{i})_{i \in N}$ of parameters $T_{0} \in N^{*}, b > 1$ (i.e., $T_{i} = ⌊ T_{0} b^{i} ⌋$ ) satisfies this lower bound for a certain constant $c^{'} > 0$ ,

\begin{matrix} \forall T \geq 1, L_{T} \geq 2 ⟹ R_{T} (A^{'}) \geq c^{'} {(log T)}^{δ + 1} . \\ (8) \end{matrix}

Moreover, this implies that $R_{T} (A^{'}) = Ω ((log T)^{δ + 1})$ , which proves that a geometric sequence cannot be used to conserve a logarithmic regret bound.

If the regret is lower bounded at finite horizon by $R_{T} (A_{T}) \geq c \sqrt{T}$ , without surprise we can prove that a geometric doubling sequence gives a comparable lower bound for the Doubling Trick algorithm $D T (A)$ (see Th. 5 in Appendix B). But more importantly, we show with Theorem 2 that a geometric sequence cannot be used to conserve a finite-horizon lower bound like $R_{T} (A_{T}) \geq c log (T)$ .

This special case ( $δ = 1$ ) is the most interesting, as efficient algorithms for stationary bandits must match the lower bound from [17]: if $R_{T} (A_{T})$ satisfies both (finite-time) logarithmic lower and upper bound (i.e., if $\frac{R_{T} (A_{T})}{log T}$ is bounded for $T$ large enough), then using a geometric sequence for the doubling trick algorithm is a bad idea as it guarantees a blow up in the regret lower bound, by implying $R_{T} (D T (A, (T_{i})_{i \in N})) = Ω ((log T)^{2})$ . This result is the reason why we need to study successive horizons growing faster than a geometric sequence (i.e., such that $log (T_{i}) ≫ i$ ), like the exponential sequence studied in next Section 4.

3.3Proof of Theorem 2

Let $A^{'} := D T (A, (T_{i})_{i \in N})$ and consider a fixed stochastic bandit problem. The lower bound from Lemma 1 gives

\begin{matrix} R_{T} (A^{'}) & \geq L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T = T_{i} - T_{i - 1}}) \end{matrix}

We bound $T_{i} - T_{i - 1} \geq T_{0} (b - 1) b^{i - 1} - 1$ for any $i > 0$ , thanks to Definition 2, and we can use the hypothesis on $A$ for each regret term.

\begin{matrix} \geq L_{T} - 1 \sum i = 0 c (log (T_{i} - T_{i - 1}))^{δ} \geq c L_{T} - 1 \sum i = 1 (log (T_{0} (b - 1) b^{i - 1} - 1))^{δ} = c L_{T} - 2 \sum i = 0 (log (T_{0} (b - 1) b^{i} - 1))^{δ} (with i := i - 1) \end{matrix}

Let $x_{i} := T_{0} (b - 1) b^{i} > 0$ . If we have $T_{0} (b - 1) > 1$ (see below ( $♣$ ) for a discussion on the other case), then Lemma 5 (Eq. ) gives $log (x_{i} - 1) \geq \frac{log (T_{0} (b - 1) - 1)}{log (T_{0} (b - 1))} log (x_{i})$ as $x_{i} > 1$ . For lower bounds, there is no need to handle the constants tightly, and we have $x_{i} \geq b^{i}$ by hypothesis, so let call this constant $c^{'} = c {(\frac{log (T_{0} (b - 1) - 1)}{log (T_{0} (b - 1))})}^{δ} > 0$ , and thus it simplifies to

\begin{matrix} \geq c^{'} L_{T} - 2 \sum i = 0 {(log (b^{i}))}^{δ} \end{matrix}

A sum-integral minoration for the increasing function $t \mapsto t^{δ}$ (as $δ > 0$ ) gives $\sum_{i = 0}^{L_{T} - 2} {(log (b^{i}))}^{δ} = (log b)^{δ} \sum_{i = 1}^{L_{T} - 2} i^{δ} \geq (log b)^{δ} \int_{0}^{L_{T} - 2} t^{δ} d t = \frac{(log b)^{δ}}{δ + 1} {(L_{T} - 2)}^{δ + 1}$ (if $L_{T} \geq 2$ ), and so

\begin{matrix} R_{T} (A^{'}) & \geq c^{'} \frac{(log b)^{δ}}{δ + 1} (L_{T} - 2)^{δ + 1} \end{matrix}

For the geometric sequence, we know that $L_{T} \geq {log}_{b} (\frac{T}{T_{0}}) \geq {log}_{b} (T)$ , and ${log}_{b} (T) - 2 \sim {log}_{b} (T)$ at $T \to \infty$ so there exists a constant $0 < c^{''} < 1$ such that $L_{T} - 2 \geq c^{''} {log}_{b} (T)$ for $T$ large enough ( $\geq \frac{2}{b - 1}$ ), and such that $L_{T} \geq 2$ . And thus we just proved that there is a constant $c^{'''} > 0$ such that

\begin{matrix} R_{T} (A^{'}) & \geq c^{'''} (log T)^{δ + 1} =: g (T) . \end{matrix}

So this proves that for $T$ large enough, $R_{T} (A^{'}) \geq g (T)$ with $g (T) = O ((log T)^{δ + 1})$ , and so $R_{T} (A^{'}) = Ω ((log T)^{δ + 1})$ , which also implies that $R_{T} (A^{'})$ cannot be a $O ((log T)^{δ})$ .

( $♣$ ) If we do not have the hypothesis $T_{0} (b - 1) > 1$ , the same proof could be done, by observing that from $i \geq i_{0}$ large enough, we have $x_{i} \geq b^{i - i_{0}}$ (as soon as $b^{i_{0}} \geq \frac{1}{T_{0} (b - 1)} > 0$ , i.e., $i_{0} \geq ⌈ - {log}_{b} (T_{0} (b - 1)) ⌉ \geq 1$ ), and so the same arguments can be used, to obtain a sum from $i = i_{0} + 1$ instead of from $i = 1$ . For a fixed $i_{0}$ , we also have $L_{T} - 2 - i_{0} \geq c^{''} log (T)$ for a (small enough) constant $c^{''}$ , and thus we obtain the same result. $■$

4What Can the Exponential Doubling Trick Do?

We define exponential doubling sequences, and prove that they can be used to conserve bounds in $O ((log T)^{δ})$ , unlike the previously studied geometric sequences. Furthermore, we provide elements showing that they may also conserve bounds in $O (T^{γ})$ or $O (T^{γ} (log T)^{δ})$ .

Definition 3: Exponential Growth For $a, b \in R$ , $a, b > 1$ and $T_{0} \in N^{*}$ , if $τ := \frac{T_{0}}{a} \in R, \geq 0$ , then the sequence defined by $T_{i} := ⌊ τ a^{b^{i}} ⌋$ is non-decreasing and diverging, and satisfies

\begin{matrix} \forall T < T_{0}, L_{T} = 0, and \forall T \geq T_{0}, L_{T} = ⌈ {log}_{b} ({log}_{a} (\frac{T}{τ})) ⌉ . \\ (9) \end{matrix}

Asymptotically for $i$ and $T \to \infty$ , $T_{i} = O (a^{b^{i}})$ and $L_{T} \sim {log}_{b} ({log}_{a} (\frac{T}{τ})) = O (log log T)$ .

4.1Conserving Regret Upper Bound with Exponential Horizons

An exponential doubling sequence allows to conserve a problem-dependent bound on regret (i.e., $R_{T} = O (log T)$ ). This was already used by [3] in a particular case, and for example more recently by [20]. We generalize this result in the following theorem.

Theorem 3 If an algorithm $A$ satisfies $R_{T} (A_{T}) \leq c T^{γ} (log T)^{δ} + f (T)$ , for $0 \leq γ < 1$ , $δ > 0$ , and for $c > 0$ , and an increasing function $f (t) = o (t^{γ} (log t)^{δ})$ (at $t \to \infty$ ), then the anytime version $A^{'} := D T (A, (T_{i})_{i \in N})$ with the exponential sequence $(T_{i})_{i \in N}$ of parameters $T_{0} \in N^{*}, a, b > 1$ (i.e., $T_{i} = ⌊ \frac{T_{0}}{a} a^{b^{i}} ⌋$ ), satisfies the following inequality,

\begin{matrix} R_{T} (A^{'}) \leq ℓ (γ, δ, T_{0}, a, b) c {(T^{b})}^{γ} {(log T)}^{δ} + g (T) . \\ (10) \end{matrix}

with an increasing function $g (t) = o ((t^{b})^{γ} (log t)^{δ})$ , and a constant loss $ℓ (γ, δ, T_{0}, a, b) > 0$ ,

\begin{matrix} ℓ (γ, δ, T_{0}, a, b) := ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} {(\frac{a}{T_{0}})}^{(b - 1) γ} \frac{b^{2 δ}}{b^{δ} - 1} > 0 & if δ > 0 1 + \frac{1}{(log (a)) (log (b^{γ}))} > 1 & if δ = 0 \end{matrix} \\ (11) \end{matrix}

This result is as rich as Theorem 1 but for logarithmic bounds (i.e., $γ = 0$ ), with no loss but a constant $ℓ \geq 4$ . It can also be applied to bounds of the generic form $R_{T} = O (T^{γ} (log T)^{δ})$ (i.e., with $γ > 0$ ), but with a significant loss from $T^{γ}$ to $T^{b γ}$ , and a constant loss $ℓ > 0$ . It is important to notice that for $γ > 0$ , the loss $ℓ$ can be arbitrarily made small (by taking a first step $T_{0}$ large enough). This observation is encouraging, and let the authors think that a tight upper bound could be proved.

Corollary 2 In particular, order-optimal and optimal algorithms for problem-dependent bound have $γ = 0$ and $δ = 1$ , as well as $f (t) = 0$ , in which case Theorem 3 gives a simpler bound, as for any $b > 1$ , $ℓ (γ = 0, b) = \frac{b^{2}}{b - 1} \geq 4$ , and so

R_{T} (A_{T}) \leq c log (T) ⟹ R_{T} (D T (A, (⌊ T_{0}^{b^{i}} ⌋)_{i \in N})) \leq \frac{b^{2}}{b - 1} c log (T) .

Remark: the optimal choice of $b = 2$ gives a constant loss of $ℓ (γ = 0, b) = 4$ , which is twice better than the loss of $8.0625$ obtained¹ in [3].

We observe that the constant loss from Theorem 3 blows up for $δ = 0$ (as $b^{δ} - 1 \to 0$ ), and we give below in Theorem 4 another result toward a more precise lower bound.

4.2Minimax Regret Lower Bound with Exponential Horizons

The lower bound from Theorem 4 below is a second step towards proving or disproving the converse of the lower bound of Theorem 2, for bounds of the form $R_{T} = O (T^{γ} (log T)^{δ})$ (with or without $δ = 0$ ). and an exponential sequence. This result, along with the previous one Theorem 3, starts to answer the question of whether an exponential sequence is sufficient to conserve both minimax and problem-dependent guarantees with Doubling Tricks. Theorem 4 is proved in Appendix A.3.

Theorem 4 For stochastic models, if $A$ satisfies $R_{T} (A_{T}) \geq c T^{γ}$ , for $c > 0$ and $0 < γ \leq 1$ , then the anytime version $A^{'} := D T (A, (T_{i})_{i \in N})$ with the exponential sequence $(T_{i})_{i \in N}$ of parameters $T_{0} \in N^{*}$ , $a > 1$ , $b > 1$ (i.e., $T_{i} = ⌊ \frac{T_{0}}{a} a^{b^{i}} ⌋$ ), satisfies this lower bound for a certain constant $c^{'} > 0$ ,

\forall T \geq 1, R_{T} (A^{'}) \geq c^{'} {(T^{\frac{1}{b}})}^{γ} .

If we could just take $b \to 1$ in the two previous Theorems 3 and 4, both results would match and prove that the exponential doubling trick can indeed be used to conserve minimax regret bounds. But $b$ obviously cannot depend on $T$ , and even if it can be taken as small as we want, it cannot be taken to $1$ (it has to be constant when $T \to \infty$ ).

4.3Proof of Theorem 3

Let $A^{'} := D T (A, (T_{i})_{i \in N})$ , and consider a fixed bandit problem. We first consider the harder case of $δ > 0$ , see below in $(♠)$ for the other case. The lower bound from Lemma 1 gives

\begin{matrix} R_{T} (A^{'}) & \leq L_{T} \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T = T_{i} - T_{i - 1}}) \end{matrix}

We bound naively $T_{i} - T_{i - 1} \leq T_{i} \leq \frac{T_{0}}{a} a^{b^{i}}$ , and we can use the hypothesis on $A$ for each regret term, as $f$ and $t \mapsto c t^{γ} (log t)^{δ}$ are non-decreasing for $t \geq 1$ (by hypothesis for $f$ and by Lemma 6.

\begin{matrix} \leq L_{T} \sum i = 0 f (T_{i}) + c L_{T} \sum i = 0 {(T_{i})}_{i}^{γ} {(log (T_{i}))}^{δ} \leq g_{1} (T) + c L_{T} \sum i = 0 {(\frac{T_{0}}{a} a^{b^{i}})}^{γ} {(log (\frac{T_{0}}{a} a^{b^{i}}))}^{δ} \end{matrix}

The first part is denoted $g_{1} (T) := \sum_{i = 0}^{L_{T}} f (T_{i})$ and is dealt with Lemma 7: the sum of $f (T_{i})$ is a $o (\sum_{i = 0}^{L_{T}} T_{i}^{γ} (log (T_{i}))^{δ})$ , as $f (t) = o (t^{γ} (log t)^{δ})$ by hypothesis, and this sum of $T_{i}^{γ} (log (T_{i}))^{δ}$ is proved below to be bounded by $T^{b γ} (log (T))^{δ}$ . So $g_{1} (T) = o (T^{b γ} (log T)^{δ})$ . The second part is $c {(\frac{T_{0}}{a})}^{γ} \sum_{i = 0}^{L_{T}} {(a^{b^{i}})}^{γ} {(log (\frac{T_{0}}{a} a^{b^{i}}))}^{δ}$ . Define ${log}^{+} (x) := max (log (x), 0) \geq 0$ , so whether $\frac{T_{0}}{a} \leq 1$ or $> 1$ , we always have $log (\frac{T_{0}}{a} a^{b^{i}}) \leq {log}^{+} (\frac{T_{0}}{a}) + log (a^{b^{i}})$ . Then we can use Lemma 4 (Eq. ) to distribute the power on $δ$ (as it is $< 1$ ). So ${(log (\frac{T_{0}}{a} a^{b^{i}}))}^{δ} \leq {({log}^{+} (\frac{T_{0}}{a}))}^{δ} + {(log (a))}^{δ} {(b^{i})}^{δ}$ with the convention that $0^{δ} = 0$ (even if $δ = 0$ ), and so this gives

\begin{matrix} \leq g_{1} (T) + c {(\frac{T_{0}}{a})}^{γ} [{({log}^{+} (\frac{T_{0}}{a}))}^{δ} L_{T} \sum i = 0 (a^{b^{i}})^{γ} + {(log (a))}^{δ} L_{T} \sum i = 0 (a^{b^{i}})^{γ} {(b^{i})}^{δ}] \end{matrix}

If $γ = 0$ then the first sum is just $L_{T} + 1 = O (log (log (T)))$ which can be included in $g_{1} (T) = o ((log T)^{δ})$ (still increasing), and so only the second sum has to be bounded, and a geometric sum gives $\sum_{i = 0}^{L_{T}} {(b^{i})}^{δ} \leq \frac{b^{δ}}{b^{δ} - 1} (b^{L_{T}})^{δ}$ . But if $γ > 0$ , we can naively bound the first sum by $\sum_{i = 0}^{L_{T}} (a^{b^{i}})^{γ} \leq (L_{T} + 1) (a^{b^{L_{T}}})^{γ}$ Observe that $a^{b^{L_{T}}} = (a^{b^{L_{T}} - 1})^{b} \leq (a \frac{T}{T_{0}})^{b}$ . So $a^{b^{L_{T}}} = O (T^{b})$ and $L_{T} + 1 = O (log (log (T)))$ , thus the first sum is a $O (T^{b γ} log (log (T))) = o (T^{b γ} (log T)^{δ})$ (as $δ > 0$ ). In both cases, the first sum can be included in $g_{2} (T)$ which is still a $o (T^{b γ} (log T)^{δ})$ Another geometric sum bounds the second sum by $\sum_{i = 0}^{L_{T}} (a^{b^{i}})^{γ} {(b^{i})}^{δ} \leq (a^{b^{L_{T}}})^{γ} \sum_{i = 0}^{L_{T}} {(b^{i})}^{δ} \leq \frac{b^{δ}}{b^{δ} - 1} (b^{L_{T}})^{δ}$ .

\begin{matrix} \leq g_{1} (T) + c_{1} (a^{b^{L_{T}}})^{γ} (b^{L_{T}})^{δ} \end{matrix}

We identify a constant multiplicative loss $c_{1} := c {(\frac{T_{0}}{a})}^{γ} \frac{b^{δ}}{b^{δ} - 1} {(log a)}^{δ} > 0$ . The only term left which depends on $L_{T}$ is $(a^{b^{L_{T}}})^{γ} (b^{L_{T}})^{δ}$ , and it can be bounded by using $b^{L_{T}} = b b^{L_{T} - 1} \leq b {log}_{a} (a \frac{T}{T_{0}}) = b + b {log}_{a}^{+} (\frac{T}{T_{0}}) \leq b + b {log}_{a} (T)$ (as $T \geq 1$ ), and again with $a^{b^{L_{T}}} \leq (a \frac{T}{T_{0}})^{b}$ . The constant part of $b^{L_{T}}$ also gives a $O (T^{b γ})$ term, that can be included in $g (T) := g_{2} (T) + (a \frac{T}{T_{0}})^{b γ}$ which is still a $o (T^{b γ} (log T)^{δ})$ , and is still increasing as sum of increasing functions. So we can focus on the last term, and we obtain

\begin{matrix} \leq g (T) + c_{1} {(\frac{b}{log (a)})}^{δ} {[{(\frac{a}{T_{0}})}^{b} T^{b}]}^{γ} {(log T)}^{δ} ⟹ R_{T} (A^{'}) & \leq g (T) + ℓ (γ, δ, T_{0}, a, b) c T^{b γ} {(log T)}^{δ} with an increasing g (t) = o (t^{b γ} {(log t)}^{δ}) . \end{matrix}

So the constant multiplicative loss $ℓ$ depends on $γ$ and $δ$ as well as on $T_{0}$ , $a$ and $b$ and is

\begin{matrix} ℓ (γ, δ, T_{0}, a, b) := {(\frac{a}{T_{0}})}^{(b - 1) γ} \times \frac{b^{2 δ}}{b^{δ} - 1} > 0, if δ > 0. \end{matrix}

If $T_{0} = a$ , the loss $ℓ (γ, δ, T_{0}, a, b)$ is minimal at $b^{*} (δ) = 2^{1 / δ} > 1$ and for a minimal value of ${min}_{b > 1} ℓ (γ, δ, T_{0}, a, b) = 4$ (for any $δ$ and $γ$ ). Finally, the $a / T_{0}$ part tends to $0$ if $T_{0} \to \infty$ so the loss can be made as small as we want, simply by choosing a $T_{0}$ large enough (but constant w.r.t. $T$ ).

$(♠)$ Now for the other case of $δ = 0$ , we can start similarly, but instead of bounding naively $\sum_{i = 0}^{L_{T}} (a^{b^{i}})^{γ}$ by $(L_{T} + 1) (a^{b^{L_{T}}})^{γ}$ we use Lemma 3 to get a more subtle bound: $\sum_{i = 0}^{L_{T}} (a^{b^{i}})^{γ} \leq a^{γ} + (1 + \frac{1}{(log (a)) (log (b^{γ}))}) (a^{b^{L_{T}}})^{γ}$ . The constant term gets included in $g (T)$ , and for the non-constant part, $(a^{b^{L_{T}}})^{γ}$ is handled similarly. Finally we obtain the loss

\begin{matrix} \begin{matrix} ℓ (γ, 0, T_{0}, a, b) := 1 + \frac{1}{(log (a)) (log (b^{γ}))} > 1. \end{matrix} \\ (12) \end{matrix}

$■$

5Numerical Experiments

We illustrate here the practical cost of using Doubling Trick, for two interesting non-anytime algorithms that have recently been proposed in the literature: Approximated Finite-Horizon Gittins indexes, that we refer to as $A F H G$ , by [18] (for Gaussian bandits with known variance) and $k l - {U C B}^{+ +}$ by [21] (for Bernoulli bandits).

We first provide some details on these two algorithms, and then illustrate the behavior of Doubling Tricks applied to these algorithms with different doubling sequences.

5.1Two Index-Based Algorithms

We denote by $X_{k} (t) := \sum_{s < t} Y_{A (s), s} 1 (A (s) = k)$ the accumulated rewards from arm $k$ , and $N_{k} (t) := \sum_{s < t} 1 (A (s) = k)$ the number of times arm $k$ was sampled. Both algorithms $A$ assume to know the horizon $T$ . They compute an index $I_{k}^{A} (t) \in R$ for each arm $k \in {1, \dots, K}$ at each time step $t \in {1, \dots, T}$ , and use the indexes to choose the arm with highest index, i.e., $A (t) := arg {max}_{k \in {1, \dots, K}} I_{k} (t)$ (ties are broken uniformly at random).

The algorithm $A F H G$ can be applied for Gaussian bandits with variance $V$ ( $= 1$ for our experiments). Let $m (T, t) = T - t + 1 \geq 1$ , and let

$\begin{matrix} I_{k}^{A F H G} (t) := \frac{X_{k} (t)}{N_{k} (t)} + \sqrt{\frac{V}{N_{k} (t)} log ⎛ ⎜ ⎜ ⎝ \frac{m (T, t)}{N_{k} (t) {log}^{1 / 2} (\frac{m (T, t)}{N_{k} (t)})} ⎞ ⎟ ⎟ ⎠} . \\ (13) \end{matrix}$
The algorithm $k l - {U C B}^{+ +}$ can be applied for bounded rewards in $[0, 1]$ , and in particular for Bernoulli bandits. The binary Kullback-Leibler divergence is $k l (x, y) := x log (x / y) + (1 - x) log ((1 - x) / (1 - y))$ (for $0 < x, y < 1$ ), and let ${log}^{+} (x) := max (0, log (x)) \geq 0$ . Let the function $g (n, T) := {log}^{+} (\frac{T}{K n} (1 + {({log}^{+} (\frac{T}{K n}))}^{2}))$ for $n \leq T$ , and finally let

$\begin{matrix} I_{k}^{k l - {U C B}^{+ +}} (t) := sup q \in [0, 1] {q : k l (\frac{X_{k} (t)}{N_{k} (t)}, q) \leq \frac{g (N_{k} (t), T)}{N_{k} (t)}} . \\ (14) \end{matrix}$

5.2Experiments

We present some results from numerical experiments on Bernoulli and Gaussian bandits. More results are presented in Appendix E. We present in pages and results for $K = 9$ arms and horizon $T = 45678$ (to ensure that no choice of sequence were lucky and had $T_{L_{T} - 1} = T$ or too close to it). We ran $n = 1000$ repetitions of the random experiment, either on the same “easy” bandit problem $μ$ (with evenly spaced means), or on $n$ different random instances $μ$ sampled uniformly in $[0, 1]^{K}$ , and we plot the average regret on $n$ simulations. The black line without markers is the (asymptotic) lower bound in $\sum_{k \neq k^{*}} (k l (μ_{k}, μ^{*}))^{- 1} log T$ , from [17]. We consider $k l - {U C B}^{+ +}$ for Bernoulli bandits (Figures Figure 2, Figure 3) or $A F H G$ for Gaussian bandits (Figures Figure 4, Figure 6),

Each doubling trick algorithm uses the same $T_{0} = 200$ as a first guess for the horizon. We include both the non-anytime version that knows the horizon $T$ , and different anytime versions to compare the choice of doubling trick. To compare against an algorithm that does not need the horizon, we also include $k l - U C B$ by [8] as a baseline for Bernoulli problems, and $U C B$ by [5] (knowing the variance $V = 1$ ) for Gaussian problems. The doubling sequences we consider are a geometric one with $b = 2$ , and different exponential sequences: the “classical” one with $b = 2$ and the “slow” one with $b = 1.1$ both with $a = T_{0} = 200$ , and the last one is using $a = 2$ , $b = 2$ . Despite what was proved theoretically in Theorem 3, using $a = T_{0}$ and a large enough $T_{0}$ improves regarding to using $a = 2$ and a leading $(T_{0} / a)$ factor.

Another version of the Doubling Trick with “no restart”, denoted ${D T}_{no-restart}$ , is presented in Appendix C, but it is only an heuristic and cannot be applied to any algorithm $A$ . Algorithm Figure 5 can be applied to $k l - {U C B}^{+ +}$ or $A F H G$ for instance, as they use $T$ just as a numerical parameter (see Eqs. Equation 13 and Equation 14), but its first limitation is that it cannot be applied to $D M E D +$ [13] or $E X P$ $3^{+ +}$ [24], or any algorithms based on arm eliminations, for example. A second limitation is the difficulty to analyze this “no restart” variant, due to the unpredictable effect on regret of giving non-uniform prior information to the underlying algorithm $A$ on each successive sequence. An interesting future work would be to analyze it, either in general or for a specific algorithm like $k l - {U C B}^{+ +}$ . Despite its limitations, this heuristic exhibits as expected better empirical performance than $D T$ , as can be observed in Appendix E.

6Conclusion

We formalized and studied the well-known “Doubling Trick” for generic multi-armed bandit problems, that is used to automatically obtain an anytime algorithm from any non-anytime algorithm. Our results are summarized in Table ?. We show that a geometric doubling can be used to conserve minimax regret bounds (in $\sqrt{T}$ ), with a constant loss (typically $\geq 3.33$ ), but cannot be used to conserve problem-dependent bounds (in $log T$ ), for which a faster doubling sequence is needed. An exponential doubling sequence can conserve logarithmic regret bounds also with a constant loss, but it is still an open question to know if minimax bounds can be obtained for this faster growing sequence. Partial results of both a lower and an upper bound, for bounds of the generic form $T^{γ} (log T)^{δ}$ , let use believe in a positive answer.

It is still an open problem to know if an anytime algorithm can be both asymptotically optimal for the problem-dependent regret (i.e., with the exact constant) and optimal in a minimax regret (i.e., have a $\sqrt{K T}$ regret), but we believe that using a doubling trick on non-anytime algorithms like $k l - {U C B}^{+ +}$ cannot be the solution. We showed that it cannot work with a geometric doubling sequence, and conjecture that exponential doubling trick would never bring the right constant either.

Bound $∖$ Doubling	Geometric, $T_{i} = ⌊ T_{0} b^{i} ⌋$	Exponential, $T_{i} = ⌊ \frac{T_{0}}{a} a^{b^{i}} ⌋$
$(log T)^{δ}$	$\times$ Known to fail $R_{T} (D T) \geq c^{'} (log T)^{1 + δ}$ if $R_{T} (A_{T}) \geq c (log T)^{δ}$ . (Theorem 2)	$✓$ Known to work, with loss $ℓ (δ, b) = \frac{b^{2 δ}}{b^{δ} - 1} > 1$ . (Theorem 3)

Bound $∖$ Doubling	Geometric, $T_{i} = ⌊ T_{0} b^{i} ⌋$	Exponential, $T_{i} = ⌊ \frac{T_{0}}{a} a^{b^{i}} ⌋$
$T^{γ}$	$✓$ Known to work, with loss $ℓ (γ, b) = \frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1} > 1$ . (Theorem 1)	? Partial, best known bound is $c_{0}^{'} (T^{\frac{1}{b}})^{γ} \leq R_{T} (D T \leq ℓ c (T^{b})^{γ}$ with a loss $ℓ > 1$ , if $c_{0} T^{γ} \leq R_{T} (A_{T}) \leq c T^{γ}$ . (Theorems 3, 4)

Bound $∖$ Doubling	Geometric, $T_{i} = ⌊ T_{0} b^{i} ⌋$	Exponential, $T_{i} = ⌊ \frac{T_{0}}{a} a^{b^{i}} ⌋$
$T^{γ} (log T)^{δ}$ for both $γ > 0$ , $δ > 0$	$✓$ Known to work, with loss $ℓ (γ, δ, T_{0}, b) = {(\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} \frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1} > 1$ . (Theorem 1 )	? Partial, best known bound is $R_{T} (D T) \leq ℓ c (T^{b})^{γ}$ if $R_{T} (A_{T}) \leq c T^{γ}$ , with a loss $ℓ \to 0$ for $T_{0} \to \infty$ . (Theorem 3)

Summary of known positive $✓$ and negative $\times$ and partial results ?.

Figure 2: Regret for \mathcal{DT}, for K=9 Bernoulli arms, horizon T=45678, n=1000 repetitions and \boldsymbol{\mu} taken uniformly in [0,1]^K. Geometric doubling (b=2) and slow exponential doubling (b=1.1) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At t=40000 we see clearly the effect of a new sequence for the best doubling trick (T_i = 200 \times 2^i). As expected, \mathrm{kl}\text{-}\mathrm{UCB}^{++} outperforms \mathrm{kl}\text{-}\mathrm{UCB}, and if the doubling sequence is growing fast enough then \ensuremath{\mathcal{DT}}(\ensuremath{\mathrm{kl}\text{-}\mathrm{UCB}^{++}}) can perform as well as \mathrm{kl}\text{-}\mathrm{UCB}^{++} (see for t < 40000). — Figure 2: Regret for $D T$ , for $K = 9$ Bernoulli arms, horizon $T = 45678$ , $n = 1000$ repetitions and $μ$ taken uniformly in $[0, 1]^{K}$ . Geometric doubling ( $b = 2$ ) and slow exponential doubling ( $b = 1.1$ ) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At $t = 40000$ we see clearly the effect of a new sequence for the best doubling trick ( $T_{i} = 200 \times 2^{i}$ ). As expected, $k l - {U C B}^{+ +}$ outperforms $k l - U C B$ , and if the doubling sequence is growing fast enough then $D T (k l - {U C B}^{+ +})$ can perform as well as $k l - {U C B}^{+ +}$ (see for $t < 40000$ ).

Figure 3: Similarly but for \boldsymbol{\mu} evenly spaced in [0,1]^K (\{0.1,\dots,0.9\}). Both \mathrm{kl}\text{-}\mathrm{UCB} and \mathrm{kl}\text{-}\mathrm{UCB}^{++} are very efficient on easy problems like this one, and we can check visually that they match the lower bound from . As before we check that slow doubling are too slow to give reasonable performance. — Figure 3: Similarly but for $μ$ evenly spaced in $[0, 1]^{K}$ ( ${0.1, \dots, 0.9}$ ). Both $k l - U C B$ and $k l - {U C B}^{+ +}$ are very efficient on “easy” problems like this one, and we can check visually that they match the lower bound from . As before we check that slow doubling are too slow to give reasonable performance.

Figure 4: Regret for K=9 Gaussian arms \mathcal{N}(\mu, 1), horizon T=45678, n=1000 repetitions and \boldsymbol{\mu} taken uniformly in [-5,5]^K and variance V=1. On hard problems like this one, both \mathrm{UCB} and \mathrm{AFHG} perform similarly and poorly w.r.t. to the lower bound from . As before we check that geometric doubling (b=2) and slow exponential doubling (b=1.1) are too slow, but a fast enough doubling sequence does give reasonable performance for the anytime \mathrm{AFHG} obtained by Doubling Trick. — Figure 4: Regret for $K = 9$ Gaussian arms $N (μ, 1)$ , horizon $T = 45678$ , $n = 1000$ repetitions and $μ$ taken uniformly in $[- 5, 5]^{K}$ and variance $V = 1$ . On “hard” problems like this one, both $U C B$ and $A F H G$ perform similarly and poorly *w.r.t.* to the lower bound from . As before we check that geometric doubling ( $b = 2$ ) and slow exponential doubling ( $b = 1.1$ ) are too slow, but a fast enough doubling sequence does give reasonable performance for the anytime $A F H G$ obtained by Doubling Trick.

Note The (Python 3) simulation code used for the experiments is open-sourced at https://GitHub.com/SMPyBandits/SMPyBandits, and documented at https://smpybandits.github.io/.

AOmitted Proofs

We include here the proofs omitted in the main document.

a.1Proof of Lemma 1, “Regret Lower and Upper Bounds for $D T$ ”

Let $A^{'}$ denote $D T (A, (T_{i})_{i \in N})$ . For every $k \in {1, \dots, K}$ ,

\begin{matrix} E [T \sum t = 1 (X_{k, t} - X_{A (t), t})] = L_{T} - 1 \sum i = 0 E ⎡ ⎣ T_{i} \sum t = T_{i - 1} (X_{k, t} - X_{A (t), t}) ⎤ ⎦ + E ⎡ ⎢ ⎣ T \sum t = T_{L_{T} - 1} (X_{k, t} - X_{A (t), t}) ⎤ ⎥ ⎦ \leq L_{T} - 1 \sum i = 0 max k \in {1, \dots, K} E ⎡ ⎣ T_{i} \sum t = T_{i - 1} (X_{k, t} - X_{A (t), t}) ⎤ ⎦ + max k \in {1, \dots, K} E ⎡ ⎢ ⎣ T \sum t = T_{L_{T} - 1} (X_{k, t} - X_{A (t), t}) ⎤ ⎥ ⎦ \leq L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A^{'}) + R_{T - T_{L_{T} - 1}} (A^{'}) . \end{matrix}

Thus, by definition of the regret

\begin{matrix} R_{T} (A^{'}) & \leq L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A^{'}) + R_{T - T_{L_{T} - 1}} (A^{'}) = L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T_{i} - T_{i - 1}}) + R_{T - T_{L_{T} - 1} \leq T_{L_{T}} - T_{L_{T} - 1}} (A_{T_{L_{T}} - T_{L_{T - 1}}}) \leq L_{T} \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T_{i} - T_{i - 1}}) . \end{matrix}

In the stochastic case, it is well known that the regret can be rewritten in the following way, introducing $μ_{k}$ the mean of arm $k$ and $μ^{*}$ the mean of the best arm:

\begin{matrix} R_{T} (A^{'}) & = E [T \sum t = 1 (μ^{*} - μ_{A (t)})] = L_{T - 1} \sum i = 0 E ⎡ ⎣ T_{i} \sum t = T_{i - 1} (μ^{*} - μ_{A (t)}) ⎤ ⎦ + E ⎡ ⎢ ⎣ T \sum t = T_{L_{T} - 1} (μ^{*} - μ_{A (t)}) ⎤ ⎥ ⎦ = L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T_{i} - T_{i - 1}})) + R_{T - T_{L_{T} - 1}} (A^{'}) \geq 0 . \end{matrix}

and the lower bound follows. $■$

a.2Proof of Theorem 1, “Conserving Regret Upper Bound with Geometric Horizons”

It is interesting to note that the proof is valid for both the easiest case when $δ = 0$ (as it was known in [9] for $γ = 1 / 2$ ) and the generic case when $δ \geq 0$ , with no distinction.

As far as the authors know, this result in its generality with $δ \geq 0$ is new.

Proof Let $A^{'} := D T (A, (T_{i})_{i \in N})$ , and consider a fixed bandit problem. The upper bound from Lemma 1 gives

\begin{matrix} R_{T} (A^{'}) & \leq L_{T} \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T = T_{i} - T_{i - 1}}) \end{matrix}

We can use the hypothesis on $A$ for each regret term, as $f$ and $t \mapsto c t^{γ} (log t)^{δ}$ are non-decreasing for $t \geq 1$ (by hypothesis for $f$ and by Lemma 6.

\begin{matrix} \leq L_{T} \sum i = 0 f (T_{i} - T_{i - 1}) + c T_{0}^{γ} (log T_{0})^{δ} + c L_{T} \sum i = 1 {(T_{i} - T_{i - 1})}^{γ} {(log (T_{i} - T_{i - 1}))}^{δ} \end{matrix}

The first part is denoted $g_{1} (T) := \sum_{i = 0}^{L_{T}} f (T_{i} - T_{i - 1}) + c T_{0}^{γ} (log T_{0})^{δ}$ , it is an increasing function as a sum of increasing functions, and it is dealt with by using Lemma 7: the sum of $f (T_{i} - T_{i - 1})$ is a $o (\sum_{i = 0}^{L_{T}} (T_{i} - T_{i - 1})^{γ} (log (T_{i} - T_{i - 1}))^{δ})$ , as $f (t) = o (t^{γ} (log t)^{δ})$ by hypothesis, and this sum of $(T_{i} - T_{i - 1})^{γ} (log (T_{i} - T_{i - 1}))^{δ}$ is proved below to be bounded by $c^{'} T^{γ} (log (T))^{δ}$ for a certain constant $c^{'} > 0$ , which gives $g_{1} (T) = o (T^{γ} (log T)^{δ})$ . For the second part, we bound $T_{i} - T_{i - 1} \leq T_{0} (b - 1) b^{i - 1} + 1$ thanks to Definition 2. Moreover, as $γ < 1$ we can use Lemma 4 (Eq. ) to distribute the power on $γ$ , so $(T_{0} (b - 1) b^{i - 1} + 1)^{γ} \leq (T_{0} (b - 1) b^{i - 1})^{γ} + 1 (γ \neq 0)$ (indeed if $γ = 0$ both sides are equal to $1$ ). This gives

\begin{matrix} \leq g_{1} (T) + c (T_{0} (b - 1))^{γ} L_{T} \sum i = 1 (b^{i - 1})^{γ} {(log (T_{i} - T_{i - 1}))}^{δ} + c 1 (γ \neq 0) L_{T} \sum i = 1 {(log (T_{i} - T_{i - 1}))}^{δ} \end{matrix}

If $γ \neq 0$ , the last sum is bounded by $\sum_{i}^{L_{T}} (log T_{i})^{δ} \leq (log T_{0})^{δ} (L_{T} + 1) + (log b)^{δ} \sum_{i}^{L_{T}} i^{δ}$ which is a $O (L_{T}^{δ + 1}) = O ((log T)^{δ + 1}) = o (T^{γ} (log T)^{δ})$ (as $γ > 0$ , thanks to a geometric sum), and so it can be included in $g_{2} (T) = o (T^{γ} (log T)^{δ})$ . If $γ = 0$ , there is only the first sum. We bound again $T_{i} - T_{i - 1} \leq T_{0} (b - 1) b^{i - 1} + 1$ and use Lemma 5 to bound $log (T_{0} (b - 1) b^{i - 1} + 1)$ by $\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))} log (T_{0} (b - 1) b^{i - 1})$ term (as $T_{0} (b - 1) > 1$ by hypothesis).

\begin{matrix} \leq g_{2} (T) + c (T_{0} (b - 1))^{γ} L_{T} \sum i = 1 (b^{i - 1})^{γ} {(\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))} log (T_{0} (b - 1) b^{i - 1}))}^{δ} \end{matrix}

We split the $log (T_{0} (b - 1) b^{i - 1})$ term in two, and once again, the term with $log (T_{0} (b - 1))$ gives a $O (b^{L_{T} - 1})$ (by a geometric sum), which gets included in $g_{3} (T) = o (T^{γ} (log T)^{δ})$ . We focus on the fastest term, and we can now rewrite the sum from $i = 0$ to $L_{T} - 1$ ,

\begin{matrix} \leq g_{2} (T) + c (T_{0} (b - 1))^{γ} {(log (b) \frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} L_{T} - 1 \sum i = 0 (b^{i})^{γ} i^{δ} \end{matrix}

We naively bound $i^{δ}$ by $(L_{T} - 1)^{δ}$ , and use a geometric sum to have

\begin{matrix} \leq g_{2} (T) + c (T_{0} (b - 1))^{γ} {(log (b) \frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} (L_{T} - 1)^{δ} \frac{b^{γ}}{b^{γ} - 1} (b^{L_{T} - 1})^{γ} \end{matrix}

Finally, observe that $L_{T} - 1 \leq {log}_{b} (\frac{T}{T_{0}}) \leq {log}_{b} (T)$ , so the $(log b)^{δ}$ term simplifies, and observe that $b^{L_{T} - 1} \leq \frac{T}{T_{0}}$ so the $T_{0}^{γ}$ term also simplifies. Thus we get

\begin{matrix} \leq g_{2} (T) + c {(\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} \frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1} T^{γ} (log T)^{δ} . \end{matrix}

The constant multiplicative loss $ℓ$ depends on $γ$ and $δ$ as well as on $T_{0}$ and $b$ , and is $ℓ (γ, δ, T_{0}, b) := {(\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} \frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1} > 1$ . $■$

Minimizing the constant loss? This constant loss has two distinct part, $ℓ (γ, δ, T_{0}, b) = ℓ_{1} (δ, T_{0}, b)$ $ℓ_{2} (γ, b)$ , with $ℓ_{1}$ depending on $δ$ , $T_{0}$ and $b$ (equal to $1$ if $δ = 0$ ), and $ℓ_{2}$ depending on $γ$ and $b$ .

Minimizing this constant loss $ℓ_{1} (δ, T_{0}, b) := {(\frac{log (T_{0} (b - 1) + 1)}{log (T_{0} (b - 1))})}^{δ} \geq 1$ is independent of $δ$ (even if it is $0$ ). If we assume $b$ to be fixed, $ℓ_{1} (δ, T_{0}, b) \to 1$ when $T_{0} \to \infty$ . Moreover, for any $δ$ and $b > 1$ , $ℓ_{1} (δ, T_{0}, b)$ goes to $1$ very quickly when $T_{0}$ is large enough. For instance, for $γ = δ = \frac{1}{2}$ and $b = \frac{3 + \sqrt{5}}{2}$ (see Corollary 1), then $ℓ_{1} (δ, T_{0}, b) ≃ 1.109$ for $T_{0} = 2$ , $≃ 1.01$ for $T_{0} = 10$ and $≃ 1.0004$ for $T_{0} = 100$ .
To minimize this constant loss $ℓ_{2} (γ, b) := \frac{b^{γ} (b - 1)^{γ}}{b^{γ} - 1} > 1$ , we fix $γ$ and study $h : b \mapsto ℓ_{2} (γ, b)$ . The function $h$ is of class $C^{1}$ on $(1, \infty)$ and $h (b) \to + \infty$ for $b \to 1^{+}$ and $b \to \infty$ , so $h$ has a (possibly non-unique) global minimum and attains it. Moreover $h^{'} (b) = \frac{γ b^{γ - 1} (b - 1)^{γ - 1}}{(b^{γ} - 1)^{2}} (b^{γ + 1} - 2 b + 1)$ has the sign of $b^{γ + 1} - 2 b + 1$ , which does not have a constant sign and does not have explicit root(s) for a generic $γ$ . However, it is easy to minimizing $ℓ_{2} (γ, b)$ for $b$ numerically when $γ$ is known and fixed (with, e.g., Newton’s method).

The result from Theorem 1 of course implies the result from [9], in the special case of $δ = 0$ and $γ = \frac{1}{2}$ (for minimax bounds), as stated numerically in the following Corollary 1.

Corollary 1 If $γ = \frac{1}{2}$ and $δ = 0$ , the multiplicative loss $ℓ (\frac{1}{2}, 0, T_{0}, b)$ does not depend on $T_{0}$ . It is then minimal for $b^{*} (\frac{1}{2}) = \frac{3 + \sqrt{5}}{2} ≃ 2.62$ and its minimum is $\sqrt{\frac{11 + 5 \sqrt{5}}{2}} ≃ 3.33$ . Usually $b = 2$ is used, which gives a loss of $\frac{\sqrt{2}}{\sqrt{2} - 1} ≃ 3.41$ , close to the optimal value.

In particular, order-optimal and optimal algorithms for the minimax bound have $γ = \frac{1}{2}$ and $f (t) = 0$ , for which Theorem 3 gives a simpler bound

\begin{matrix} R_{T} (A_{T}) \leq c \sqrt{T} ⟹ R_{T} (D T (A, (T_{0} 2^{i})_{i \in N})) \leq \frac{\sqrt{2}}{\sqrt{2} - 1} c \sqrt{T} . \\ (15) \end{matrix}

a.3Proof of Theorem 4, “Minimax Regret Lower Bound with Exponential Horizons”

Proof Let $A^{'} := D T (A, (T_{i})_{i \in N})$ , and consider a fixed stochastic bandit problem. The lower bound from Lemma 1 gives

\begin{matrix} R_{T} (A^{'}) & \geq L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T = T_{i} - T_{i - 1}}) \end{matrix}

We can use the hypothesis on $A$ for each regret term, and as $0 < γ \leq 1$ , we can use Lemma 4 (Eq. ) to distribute the power on $γ$ to ease the proof and obtain

\begin{matrix} \geq c T_{0}^{γ} + c L_{T} - 1 \sum i = 1 (T_{i} - T_{i - 1})^{γ} \geq c T_{0}^{γ} + c L_{T} - 1 \sum i = 1 (T_{i}^{γ} - T_{i - 1}^{γ}) (it is a telescopic sum and simplifies) \geq c T_{L_{T} - 1}^{γ} \end{matrix}

Observe that $T_{L_{T} - 1} \geq (T_{L_{T}})^{\frac{1}{b}}$ by definition of the exponential sequence (Definition 3), and $T_{L_{T}} \geq T$ (Definition 1). For the $log (T_{L_{T} - 1})$ term, we simply have $log (T_{L_{T} - 1}) \geq \frac{1}{b} log (T)$ so if $c^{'} = c / b$ , then we obtain what we want

\begin{matrix} R_{T} (A^{'}) & \geq c^{'} T^{\frac{γ}{b}} . \end{matrix}

This lower bound goes from $R_{T} (A_{T}) = Ω T^{γ}$ to $R_{T} (D T (A)) = Ω T^{\frac{γ}{b}}$ , and it looks very similar to the upper bound from Theorem 3 where $R_{T} (D T (A)) = O (T^{b γ})$ was obtained from $R_{T} (A_{T}) = O (T^{γ})$ . $■$

Remark It does seem sub-optimal to lower bound $T_{L_{T} - 1}$ like this ( $T_{L_{T} - 1} \geq (T_{L_{T}})^{\frac{1}{b}}$ ), but we remind that $T$ can be located anywhere in the discrete interval ${T_{L_{T} - 1}, \dots, T_{L_{T}} - 1}$ , so in the worst case when $T$ is very close to $T_{L_{T}}$ (and for large enough $T$ ), we indeed have $T_{L_{T} - 1}^{b} \sim T_{L_{T}}$ and $T_{L_{T}} \sim T$ , so with this approach, the lower bound $T_{L_{T} - 1} \geq T^{\frac{1}{b}}$ cannot be improved.

BMinimax Regret Lower Bound with Geometric Horizons

We include here a last result that partly replies to Theorem 1. It is more subtle that the lower bound in Theorem 2 but still provides an interesting insight: if $b$ is not chosen carefully (i.e., if $ℓ_{0} (γ, b) > 1$ ), then the anytime version of $A_{T}$ using a geometric Doubling Trick suffers a non-improvable constant multiplicative loss compared to $A_{T}$ .

Theorem 5 For stochastic models, if $A$ satisfies $R_{T} (A_{T}) \geq c T^{γ}$ , for $0 < γ < 1$ and $c > 0$ , then the anytime version $A^{'} := D T (A, (T_{i})_{i \in N})$ with the geometric sequence $(T_{i})_{i \in N}$ of parameters $T_{0} \in N^{*}, b > 1$ (i.e., $T_{i} = ⌊ T_{0} b^{i} ⌋$ ) satisfies

L_{T} \geq 2 ⟹ R_{T} (A^{'}) \geq ℓ_{0} (γ, b) c T^{γ} + g_{0} (T) .

with $g_{0} (t) = O (log t) = o (t^{γ})$ , and a constant loss $ℓ_{0} (γ, b)$ depending only on $γ$ and $b$ ,

\begin{matrix} ℓ_{0} (γ, b) & = \frac{(b - 1)^{γ}}{b^{γ} (b^{γ} - 1)} > 0. \end{matrix}

$ℓ_{0} (γ, b)$ is always $> 0$ and tends to $0$ for $b \to \infty$ , and some choice of $b$ gives $ℓ_{0} (γ, b) > 1$ .

Proof Let $A^{'} := D T (A, (T_{i})_{i \in N})$ , and consider a fixed stochastic bandit problem. Assume $L_{T} \geq 2$ . The lower bound from Lemma 1 gives

\begin{matrix} R_{T} (A^{'}) & \geq L_{T} - 1 \sum i = 0 R_{T_{i} - T_{i - 1}} (A_{T = T_{i} - T_{i - 1}}) \end{matrix}

We bound $T_{i} - T_{i - 1} \geq T_{0} (b - 1) b^{i - 1} - 1$ for $i > 0$ , thanks to Definition 2. and we can use the hypothesis on $A$ for each regret term. Additionally, we have $(T_{i} - T_{i - 1})^{γ} \geq (T_{0} (b - 1) b^{i - 1} - 1)^{γ} \geq (T_{0} (b - 1))^{γ} (b^{i - 1})^{γ} - 1$ by Lemma 4 (Eq. , as $b > 1$ and $0 < γ < 1$ ), thus

\begin{matrix} \geq L_{T} - 1 \sum i = 0 c (T_{i} - T_{i - 1})^{γ} \geq c T_{0}^{γ} + c T_{0}^{γ} (b - 1)^{γ} L_{T} - 1 \sum i = 1 (b^{i - 1})^{γ} - c L_{T} - 1 \sum i = 1 1 \geq c T_{0}^{γ} + c T_{0}^{γ} (b - 1)^{γ} L_{T} - 2 \sum i = 0 (b^{i})^{γ} - c (L_{T} - 1) \end{matrix}

We have $\sum_{i = 0}^{L_{T} - 1} (b^{i})^{γ} = \frac{(b^{L_{T} - 1})^{γ} - 1}{b^{γ} - 1}$ thanks to a geometric sum (with $γ > 0$ ) and thus

\begin{matrix} \geq c T_{0}^{γ} + c T_{0}^{γ} (b - 1)^{γ} \frac{(b^{L_{T} - 1})^{γ} - 1}{b^{γ} - 1} + c (1 - L_{T}) \end{matrix}

Thanks to Definition 2, $b^{L_{T} - 1}$ satisfies $b^{L_{T} - 1} \geq \frac{1}{b} \frac{T}{T_{0}}$ . Let $g_{0} (T) := c T_{0}^{γ} (\frac{(b - 1)^{γ}}{b^{γ} - 1} - 1) + c (L_{T} - 1) = O (1) + O ({log}_{b} (\frac{T}{T_{0}})) = O (log T) = o (T^{γ})$ and $g_{0} (T) > 0$ , then we have

\begin{matrix} \geq c \frac{(b - 1)^{γ}}{b^{γ} (b^{γ} - 1)} T^{γ} - [c T_{0}^{γ} (\frac{(b - 1)^{γ}}{b^{γ} - 1} - 1) + c (L_{T} - 1)] . \end{matrix}

We obtain as announced, $R_{T} (A^{'}) \geq ℓ_{0} (b) c T^{γ} + g_{0} (T)$ with $g_{0} (T) = O (log T) = o (T^{γ})$ . $■$

Maximizing the constant loss? To maximize² $ℓ_{0} (γ, b) := \frac{(b - 1)^{γ}}{b^{γ} (b^{γ} - 1)} > 0$ , we fix $γ$ and study the function $h : b \mapsto ℓ_{0} (γ, b)$ . The function $h$ is of class $C^{1}$ on $(1, \infty)$ and $h (b) \to + \infty$ for $b \to 1^{+}$ and $h (b) \to 0$ for $b \to \infty$ . Moreover $h^{'} (b) = - γ \frac{(b - 1)^{γ - 1}}{b^{γ + 1} {(b^{γ} - 1)}^{2}} (- 2 b^{γ} + b^{γ + 1} + 1)$ has the same sign as $f (b) := - (- 2 b^{γ} + b^{γ + 1} + 1)$ . The function $f$ is of class $C^{1}$ , with $f (1) = 0$ and $f^{'} (b) = - (γ + 1) (b - \frac{2 γ}{γ + 1}) b^{γ - 1}$ , and as $0 < γ < 1$ , $\frac{2 γ}{γ + 1} < 1$ so $f^{'} (b) < 0$ for all $b > 1$ . Thus $f$ is decreasing, and $\forall b > 1, f (b) < f (1) = 0$ . So $h^{'}$ has a negative sign, and this allows to conclude that $h$ is decreasing, and so $b \mapsto ℓ_{0} (γ, b)$ has no global maximum at fixed $γ$ , and $ℓ_{0} \to \infty$ if $b \to 1^{+}$ .

Relationship with the upper bound. For any $b > 1$ , we compare $ℓ_{0} (γ, b)$ with $ℓ (γ, b)$ and we see that, interestingly, $ℓ_{0} (γ, b) = ℓ_{2} (γ, b) / b^{2 γ}$ , with $ℓ_{2} (γ, b)$ from Theorem 1. For the particular case of $γ = \frac{1}{2}$ , this lower bound also leads to another interesting remark: if $b$ is chosen to minimize the loss in the upper bound (Theorem 1, $b^{*} (\frac{1}{2}) = \frac{3 + \sqrt{5}}{2}$ ), then this lower bound gives $ℓ_{0} (\frac{1}{2}, b^{*} (\frac{1}{2})) = 1 + \frac{\sqrt{2}}{2} ≃ 1.71 > 1$ , which proves that this choice of geometric doubling trick cannot be used to conserve an optimal algorithm, i.e., the constant loss cannot be made as close to $1$ as we want.

CAn Efficient Heuristic, the Doubling Trick with “No Restart”

Let $A^{'} := {D T}_{no-restart} (A, (T_{i})_{i \in N})$ denotes the following Algorithm Figure 5. The only difference with $D T$ (Algorithm Figure 1) is that the history from all the steps from $t = 1$ to $t = T_{i}$ is used to reinitialize the new algorithm $A^{(i)}$ . To be more precise, this means that a fresh algorithm $A^{(i)}$ is created, and then fed with successive observations $(A^{'} (s), Y_{A^{'} (s), s})$ for all $1 \leq s < t$ , like if it was playing from the beginning. Note that $A^{(i)}$ could have chosen a different path of actions, but we give it the observations from the previous plays of $A^{'}$ .

This obviously cannot be applied to any kind of algorithm $A$ , and for instance any algorithm based on arm elimination (e.g., Explore-Then-Commit approaches like in [12]) will most surely fail with this approach. This second doubling trick algorithm ${D T}_{no-restart}$ can be applied in practice if $A$ is index-based and uses the horizon $T$ as a simple numerical parameter in its indexes, like it is the case for instance for $A F H G$ (cf. Eq. ). or $k l - {U C B}^{+ +}$ (cf. Eq. ).

Figure 5: The Non-Restarting Doubling Trick Algorithm, \mathcal{A}' = \ensuremath{\mathcal{DT}_{\text{no-restart}}}(\mathcal{A}, (T_i)_{i\in\mathbb{N}}). — Figure 5: The Non-Restarting Doubling Trick Algorithm, $A^{'} = {D T}_{no-restart} (A, (T_{i})_{i \in N})$ .

However, it is much harder to state any theoretical result on this heuristic ${D T}_{no-restart}$ . We conjecture that a regret upper bound similar to from Lemma 1 could still be obtained, but it is still an open problem that the authors do not know how to tackle for a generic algorithm. The intuition is that starting $A^{(i)}$ with some previous observations from the (same) i.i.d. process $(Y_{k, s})_{k \in {1, \dots, K}, s \in N}$ can only improve the performance of $A^{(i)}$ , and lead to a smaller regret on the interval ${T_{i}, \dots, T_{i + 1} - 1}$ .

DBasic but Useful Results

All the logarithm $log$ are taken in base $e = exp (1)$ (i.e., natural logarithm $ln$ , but $log$ is preferred for readability). Logarithms in a basis $b > 1$ are denoted ${log}_{b} (x) := \frac{log x}{log b}$ , for any $x \in R$ , $x > 0$ .

We remind that $⌊ x ⌋$ denotes the integer part of $x \in R$ , and for $x > 0$ , that is the unique integer $i$ such that $i \leq x < i + 1$ . The only property we use is its definition and the fact that $⌊ x ⌋ \leq x$ . We also define $⌈ x ⌉ := 1 + ⌊ x ⌋$ for $x \geq 0$ , which is the unique integer $j$ such that $j - 1 \leq x < j$ .

d.1Weighted Geometric Inequality

Lemma 2: Weighted Geometric Inequality For any $n \in N^{*}$ , $b > 1$ and $δ > 0$ , and if $f$ is a function of class $C^{1}$ , non-decreasing and non-negative on $[0, \infty)$ , we have

n \sum i = 0 f (i) (b^{i})^{δ} \leq \frac{b^{δ}}{b^{δ} - 1} f (n) (b^{n})^{δ} .

Proof By hypothesis, $f$ is non-decreasing, so $\forall i \in {0, \dots, n}, f (i) \leq f (n)$ , and so by using the sum of a geometric sequence, we have

\begin{matrix} n \sum i = 0 f (i) (b^{i})^{δ} \leq f (n) (n \sum i = 0 (b^{i})^{δ}) \leq f (n) \frac{1}{b^{δ} - 1} (b^{n + 1})^{δ} = \frac{b^{δ}}{b^{δ} - 1} (f (n) (b^{n})^{δ}) . \end{matrix}

$f (i) = 1$ gives the geometric inequality. Note that if we make $δ \to 0$ , the left sum converges to $\sum_{i = 0}^{n - 1} f (i)$ and the right term diverges to $+ \infty$ , making this inequality completely uninformative. $■$

d.2Another Sum Inequality

This second result is similar to the previous one but for a “doubly exponential” sequence, i.e., $a^{b^{i}}$ , as it also bounds a sum of increasing terms by a constant times its last term.

Lemma 3 For any $n \in N^{*}$ , $a > 1$ , $b > 1$ and $γ > 0$ , we have

n \sum i = 0 (a^{b^{i}})^{γ} \leq a^{γ} + (1 + \frac{1}{(log (a)) (log (b^{γ}))}) (a^{b^{n}})^{γ} = O ((a^{b^{n}})^{γ}) .

Proof We first isolate both the first and last term in the sum and focus on the from $i = 1$ sum up to $i = n - 1$ . As the function $t \mapsto (a^{b^{t}})^{γ}$ is increasing for $t \geq 1$ , we use a sum-integral inequality, and then the change of variable $u := γ b^{t},$ of Jacobian $d t = \frac{1}{log b} \frac{d u}{u}$ , gives

\begin{matrix} n - 1 \sum i = 1 (a^{b^{i}})^{γ} & \leq \int_{1}^{n} a^{γ b^{t}} d t \leq \frac{1}{log (b^{γ})} \int_{γ b}^{γ b^{n}} \frac{a^{u}}{u} d u \end{matrix}

Now for $u \geq 1$ , observe that $\frac{a^{u}}{u} \leq a^{u}$ , and as $γ b > 1$ , we have

\begin{matrix} \leq \frac{1}{log (b^{γ})} \int_{γ b}^{γ b^{n}} a^{u} d u \leq \frac{1}{log (b^{γ})} \frac{1}{log (a)} a^{γ b^{n}} = \frac{1}{(log (a)) (log (b^{γ}))} {(a^{b^{n}})}^{γ} . \end{matrix}

Finally, we obtain as desired, $n \sum i = 0 (a^{b^{i}})^{γ} \leq a^{γ} + (a^{b^{n}})^{γ} + \frac{1}{(log (a)) (log (b^{γ}))} {(a^{b^{n}})}^{γ}$ . $■$

d.3Basic Functional Inequalities

These functional inequalities are used in the proof of the main theorems.

Lemma 4: Generalized Square Root Inequalities For any $x, y \geq 0$ and $0 < δ < 1$ ,

\begin{matrix} (x + y)^{δ} \leq x^{δ} + y^{δ} . \\ (16) \end{matrix}

And conversely for any $0 < δ < 1$ , and $x, y \geq 0$ , if $x \geq y$ then

\begin{matrix} (x - y)^{δ} \geq x^{δ} - y^{δ} . \\ (17) \end{matrix}

Proof Fix $y \geq 0$ . Let $f (x) := (x + y)^{δ} - (x^{δ} + y^{δ})$ for $x \geq 0$ . First, $f (0) = y^{δ} - y^{δ} = 0$ , and as $δ > 0$ , $f$ is differentiable on $[0, \infty)$ , with $f^{'} (x) = (log δ) (x + y)^{δ} - (log δ) x^{δ} = (log δ) ((x + y)^{δ} - x^{δ})$ , and as $δ < 1$ , $log δ < 0$ , and $(x + y)^{δ} \geq x^{δ}$ , so $f^{'} (x) \leq 0$ for any $x \geq 0$ . Therefore, $f$ is non-increasing, and so $\forall x \geq 0, f (x) \leq f (0) = 0$ , so $f$ is non-positive, giving the desired inequality (for any $y \geq 0$ and any $x \geq 0$ ).

The second inequality is a direct application of the first one. Assume $x \geq y$ , and let $x^{'} = x - y \geq 0$ , then $(x^{'} + y)^{δ} \leq (x^{'})^{δ} + y^{δ}$ . This gives $(x - y)^{δ} = (x^{'})^{δ} \leq (x^{'} + y)^{δ} - y^{δ} = x^{δ} - y^{δ}$ . $■$

Lemma 5: Bounding $log (x - Δ)$ Let $x_{0} > 1$ and $0 < Δ < x_{0}$ (e.g., $Δ \leq 1$ ), then

\begin{matrix} \forall x \geq x_{0}, \frac{log (x_{0} - Δ)}{log (x_{0})} log (x) \leq log (x - Δ) \leq log (x) . \\ (18) \end{matrix}

With $Δ = 1$ , it implies that if $T_{0} > 1$ , $b > 1$ satisfy $T_{0} (b - 1) > 1$ , then for any $i \in N$ , we have

\begin{matrix} log (T_{0} (b - 1) b^{i} - 1) \geq \frac{log (T_{0} (b - 1) - 1)}{log (T_{0} (b - 1))} log (T_{0} (b - 1) b^{i}) . \\ (19) \end{matrix}

Proof Let $f (x) := \frac{log (x - Δ)}{log (x)}$ , defined for $x \geq x_{0}$ . It is of class $C^{1}$ , and by differentiating, we have $f^{'} (x) = \frac{log (x) - log (x - Δ)}{x (log x)^{2}} > 0$ as $log$ is increasing. So $f$ is increasing, and its minimum is attained at $x = x_{0}$ , i.e., $\forall x \geq x_{0}, f (x) \geq f (x_{0}) = \frac{log (x_{0} - Δ)}{log (x_{0})} > 0$ , which gives Eq. .

The corollary is immediate but stated explicitly for clarity when used in proofs. $■$

Lemma 6 For any $γ > 0$ and $δ > 0$ , the function $f : x \mapsto x^{γ} (log x)^{δ}$ is increasing on $[1, \infty)$ .

Proof $f$ is of class $C^{1}$ on $[1, \infty)$ . First, if $γ > 0$ , we have $f^{'} (x) = x^{γ - 1} (log x)^{δ - 1} (δ + γ log (x))$ . So $f^{'} (x) > 0$ if and only if $δ + γ log (x) \geq 0$ , that it $x \geq exp (- \frac{δ}{γ})$ . But $x \geq 1$ and $< 0$ so $f^{'} (x)$ is always positive, and thus $f$ is increasing. Then, if $γ = 0$ , we have $f^{'} (x) = δ \frac{1}{x} (log x)^{δ - 1} > 0$ as $x > 1$ gives $log x > 0$ and so $(log x)^{δ - 1} > 0$ .

It is also true if $γ \geq 0$ , $δ \geq 0$ if not both are zero simultaneously. $■$

d.4Controlling an Unbounded Sum of Dominated Terms

This Lemma is used in the proofs of our upper bounds (Theorems 1 and ?), to handle the sum of $f (T_{i})$ terms. In particular, it can be applied to $(T_{i})$ the geometric sequence and $g (t) = h (t) = t^{γ}$ or $g (t) = h (t) = t^{γ} (log t)^{δ}$ (for Theorem 1) for $γ > 0$ and $δ \geq 0$ ; or $(T_{i})$ the exponential sequence, $g (t) = t^{γ} (log t)^{δ}$ and $h (t) = t^{b γ} (log t)^{δ}$ (for Theorem 3) for $γ \geq 0$ and $δ \geq 0$ . Note that it would be obvious if $L_{T}$ was bounded for $T \to \infty$ , but a more careful analysis has to be given as $L_{T} \to \infty$ .

Lemma 7 Let $f$ , $g$ and $h$ be three positive, diverging and non-decreasing functions on $[1, \infty)$ , such that $f (t) = o (g (t))$ for $t \to \infty$ . Let a non-decreasing diverging sequence $(T_{i})_{i \in N}$ , and a diverging sequence $(L_{T})_{T \in N}$ (i.e., $T_{i} \to \infty$ for $i \to \infty$ and $L_{T} \to \infty$ if $T \to \infty)$ , such that there exists a constant $c \geq 0$ satisfying $\forall T \geq 1, \sum_{i = 0}^{L_{T}} g (T_{i}) \leq c \times h (T)$ . Then the (unbounded) sum of dominated terms $f (T_{i})$ is still dominated by $h (T)$ , i.e.,

\begin{matrix} \begin{matrix} f (t) = T \to \infty o (g (t)) and \exists c \geq 0, L_{T} \sum i = 0 g (T_{i}) \leq \forall T \geq 1 c \times h (T) ⟹ L_{T} \sum i = 0 f (T_{i}) = T \to \infty o (h (T)) . \end{matrix} \\ (20) \end{matrix}

Proof By hypothesis, if $f$ is dominated by $g$ , formally $f (t) = o (g (t))$ , then there exists a positive function $ε (t)$ such that $f (t) = g (t) ε (t)$ and $ε (t) \to 0$ for $t \to 0$ . Fix $η > 0$ , as small as we want, then there exists $T_{η} \geq 1$ such that $\forall t \geq T_{η}, ε (t) \leq η$ . Let $i_{η}$ such that $T_{i_{η} - 1} < T_{η} \leq T_{i_{η}}$ . Now for any $T \geq T_{η}$ and large enough so that $L_{T} \geq T_{η}$ , we can start to split the sum

\begin{matrix} L_{T} \sum i = 0 f (T_{i}) & = i_{η} - 1 \sum i = 0 f (T_{i}) + L_{T} \sum i = i_{η} f (T_{i}) \end{matrix}

The first sum is naively bounded by $i_{η} \times f (T_{i_{η} - 1})$ as $f$ is increasing, and for the second sum, for any $i \geq i_{η}$ , $T_{i} \geq T_{η}$ and so $f (T_{i}) = ε (T_{i}) g (T_{i}) \leq η g (T_{i})$ , thus

\begin{matrix} \leq i_{η} f (T_{i_{η} - 1}) + η \times ⎛ ⎝ L_{T} \sum i = i_{η} g (T_{i}) ⎞ ⎠ \end{matrix}

The sum is smaller than a sum on a larger interval, as $g (T_{i}) \geq 0$ for any $i$ , and $f$ is increasing so

\begin{matrix} \leq i_{η} f (T_{η}) + η (L_{T} \sum i = 0 g (T_{i})) \end{matrix}

But now, $f (T_{η}) \leq η g (T_{η})$ by hypothesis, and this sum is smaller than $c \times h (T)$ also by hypothesis

\begin{matrix} \leq i_{η} η \times g (T_{η}) + η c \times h (T) = η (i_{η} g (T_{η}) + c \times h (T)) \end{matrix}

Finally, we use the hypothesis that $h (T)$ is diverging and as $η$ and $T_{η}$ are both fixed, there exists a $_{η} \geq T_{η}$ large enough so that $i_{η} g (T_{η}) \leq h (T)$ for all $T \geq_{η} .$ And so we have finally proved that

\begin{matrix} \forall η > 0, \exists_{η} \geq 1, \forall T \geq_{η}, L_{T} \sum i = 0 f (T_{i}) \leq η (c + 1) \times h (T) . \end{matrix}

This concludes the proof and shows that $L_{T} \sum i = 0 f (T_{i}) = o (h (T))$ as wanted. $■$

EAdditional Experiments

We presents additional experiments, for Gaussian bandits and for the heuristic ${D T}_{no-restart}$ .

e.1Experiments with Gaussian Bandits (with Known Variance)

We include here another figure for experiments on Gaussian bandits, see Figure 6.

Figure 6: Regret for \mathcal{DT}, for K=9 Gaussian arms \mathcal{N}(\mu, 1), horizon T=45678, n=1000 repetitions and \boldsymbol{\mu} uniformly spaced in [-5,5]^K. On easy problems like this one, both \mathrm{UCB} and \mathrm{AFHG} perform similarly and attain near constant regret (identifying the best Gaussian arm is very easy here as they are sufficiently distinct). Each doubling trick also appear to attain near constant regret, but geometric doubling (b=2) and slow exponential doubling (b=1.1) are slower to converge and thus less efficient. — Figure 6: Regret for $D T$ , for $K = 9$ Gaussian arms $N (μ, 1)$ , horizon $T = 45678$ , $n = 1000$ repetitions and $μ$ uniformly spaced in $[- 5, 5]^{K}$ . On “easy” problems like this one, both $U C B$ and $A F H G$ perform similarly and attain near constant regret (identifying the best Gaussian arm is very easy here as they are sufficiently distinct). Each doubling trick also appear to attain near constant regret, but geometric doubling ( $b = 2$ ) and slow exponential doubling ( $b = 1.1$ ) are slower to converge and thus less efficient.

e.2Experiments with ${D T}_{no-restart}$

As mentioned previously, the ${D T}_{no-restart}$ algorithm (Algorithm Figure 5) is only an heuristic so far, as no theoretical guarantee was proved for it. For the sake of completeness, we also include results from numerical experiments with it, to compare its performance against the “with restart” version $D T$ .

As expected, ${D T}_{no-restart}$ enjoys much better empirical performance, and in Figs. Figure 7 and Figure 8 we see that a geometric or a slow exponential doubling trick with no restart with $k l - {U C B}^{+ +}$ can outperform $k l - U C B$ and perform similarly to the non-anytime $k l - {U C B}^{+ +}$ . But as observed before, the regret blows up after the beginning of each new sequence if the doubling sequence increase too fast (e.g., exponential doubling). Despite what is proved theoretically in Theorem 2, here we observe that the geometric doubling is the only one to be slow enough to be efficient.

Figure 7: Regret for K=9 Bernoulli arms, horizon T=45678, n=1000 repetitions and \boldsymbol{\mu} taken uniformly in [0,1]^K, for \mathcal{DT}_{\text{no-restart}}. Geometric doubling (e.g., b=2) and slow exponential doubling (e.g., b=1.1) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At t=40000 we see clearly the effect of a new sequence for the best doubling trick (T_i = 200 \times 2^i). As expected, \mathrm{kl}\text{-}\mathrm{UCB}^{++} outperforms \mathrm{kl}\text{-}\mathrm{UCB}, and if the doubling sequence is growing fast enough then \ensuremath{\mathcal{DT}_{\text{no-restart}}}(\ensuremath{\mathrm{kl}\text{-}\mathrm{UCB}^{++}}) can perform as well as \mathrm{kl}\text{-}\mathrm{UCB}^{++}. — Figure 7: Regret for $K = 9$ Bernoulli arms, horizon $T = 45678$ , $n = 1000$ repetitions and $μ$ taken uniformly in $[0, 1]^{K}$ , for ${D T}_{no-restart}$ . Geometric doubling (*e.g.*, $b = 2$ ) and slow exponential doubling (*e.g.*, $b = 1.1$ ) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At $t = 40000$ we see clearly the effect of a new sequence for the best doubling trick ( $T_{i} = 200 \times 2^{i}$ ). As expected, $k l - {U C B}^{+ +}$ outperforms $k l - U C B$ , and if the doubling sequence is growing fast enough then ${D T}_{no-restart} (k l - {U C B}^{+ +})$ can perform as well as $k l - {U C B}^{+ +}$ .

Figure 8: K=9 Bernoulli arms with \boldsymbol{\mu} evenly spaced in [0,1]^K. On easy problems like this one, both \mathrm{kl}\text{-}\mathrm{UCB} and \mathrm{kl}\text{-}\mathrm{UCB}^{++} are very efficient, and here the geometric allows the \mathcal{DT}_{\text{no-restart}} anytime version of \mathrm{kl}\text{-}\mathrm{UCB}^{++} to outperform both \mathrm{kl}\text{-}\mathrm{UCB} and \mathrm{kl}\text{-}\mathrm{UCB}^{++}. — Figure 8: $K = 9$ Bernoulli arms with $μ$ evenly spaced in $[0, 1]^{K}$ . On easy problems like this one, both $k l - U C B$ and $k l - {U C B}^{+ +}$ are very efficient, and here the geometric allows the ${D T}_{no-restart}$ anytime version of $k l - {U C B}^{+ +}$ to outperform both $k l - U C B$ and $k l - {U C B}^{+ +}$ .

Footnotes

In [3], the authors obtained a loss of $258 / 32 = 8.0625 \geq 8$ , as the ratio between the constants for the $log (T)$ terms, respectively $258$ in Th.4.1 and $32$ in Th.3.1.
For the largest possible lower bound, we try to maximize the constant loss in the lower bound.

References

Analysis of Thompson sampling for the Multi-Armed Bandit problem.
S. Agrawal and N. Goyal. In Conference On Learning Theory. PMLR, 2012.
Minimax policies for adversarial and stochastic bandits.
J-Y. Audibert and S. Bubeck. In Conference on Learning Theory, pages 217–226. PMLR, 2009.
UCB Revisited: Improved Regret Bounds For The Stochastic Multi-Armed Bandit Problem.
P. Auer and R. Ortner. Periodica Mathematica Hungarica
Gambling in a Rigged Casino: The Adversarial Multi-Armed Bandit Problem.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. In Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE, 1995.
Finite-time Analysis of the Multi-armed Bandit Problem.
P. Auer, N. Cesa-Bianchi, and P. Fischer. Machine Learning
The Nonstochastic Multiarmed Bandit Problem.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. SIAM journal on computing
Regret Analysis of Stochastic and Non-Stochastic Multi-Armed Bandit Problems.
S. Bubeck, N. Cesa-Bianchi, et al. Foundations and Trends in Machine Learning
Kullback-Leibler upper confidence bounds for optimal sequential allocation.
O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Annals of Statistics
Prediction, Learning, and Games.
N. Cesa-Bianchi and G. Lugosi. 2006.
An Empirical Evaluation of Thompson Sampling.
O. Chapelle and L. Li. In Advances in Neural Information Processing Systems, pages 2249–2257. Curran Associates, Inc., 2011.
Anytime Optimal Algorithms In Stochastic Multi Armed Bandits.
R. Degenne and V. Perchet. In International Conference on Machine Learning, pages 1587–1595, 2016.
On Explore-Then-Commit Strategies.
A. Garivier, E. Kaufmann, and T. Lattimore. volume 29 of Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016.
An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.
J. Honda and A. Takemura. In Conference on Learning Theory, pages 67–79. PMLR, 2010.
Multi-Armed Bandit Based Policies for Cognitive Radio’s Decision Making Issues.
W. Jouini, D. Ernst, C. Moy, and J. Palicot. In International Conference Signals, Circuits and Systems. IEEE, 2009.
Thompson Sampling: an Asymptotically Optimal Finite-Time Analysis.
E. Kaufmann, N. Korda, and R. Munos. pages 199–213. PMLR, 2012.
On the Complexity of A/B Testing.
E. Kaufmann, O. Cappé, and A. Garivier. In Conference on Learning Theory, pages 461–481. PMLR, 2014.
Asymptotically Efficient Adaptive Allocation Rules.
T. L. Lai and H. Robbins. Advances in Applied Mathematics
Regret Analysis Of The Finite Horizon Gittins Index Strategy For Multi Armed Bandits.
T. Lattimore. In Conference on Learning Theory, pages 1214–1245. PMLR, 2016.
A Contextual-Bandit Approach to Personalized News Article Recommendation.
L. Li, W. Chu, J. Langford, and R. E. Schapire. In International Conference on World Wide Web, pages 661–670. ACM, 2010.
Stochastic Multi-Armed Bandits in Constant Space.
D. Liau, E. Price, Z. Song, and G. Yang. In International Conference on Artificial Intelligence and Statistics, 2018.
A Minimax and Asymptotically Optimal Algorithm for Stochastic Bandits.
P. Ménard and A. Garivier. In Algorithmic Learning Theory, volume 76, pages 223–237. PMLR, 2017.
Some Aspects of the Sequential Design of Experiments.
H. Robbins. Bulletin of the American Mathematical Society
Risk-Aversion In Multi-Armed Bandits.
A. Sani, A. Lazaric, and R. Munos. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits.
Y. Seldin and G. Lugosi. In Conference on Learning Theory, volume 65, pages 1–17. PMLR, 2017.
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.
W. R. Thompson. Biometrika
A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control.
F. Yang, A. Ramdas, K. Jamieson, and M. Wainwright. In Advances in Neural Information Processing Systems, pages 5957–5966. Curran Associates, Inc., 2017.

What Doubling Tricks Can and Can’t Do for Multi-Armed Bandits