Do for Multi-Armed Bandits

An online reinforcement learning algorithm is *anytime* if it does not need to know in advance the horizon of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the “Doubling Trick”. In the contexts of adversary or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (adversary) bounds in but *cannot* conserve (stochastic) bounds in . We give insights as to why exponential doubling tricks may be better, as they conserve bounds in , and are close to conserving bounds in .

Keywords:

Multi-Armed Bandits; Reinforcement Learning; Online Learning; Anytime Algorithms; Sequential Learning; Doubling Trick.

Multi-Armed Bandit (MAB) problems are well-studied sequential decision making problems in which an agent repeatedly chooses an action (the “arm” of a one-armed bandit) in order to maximize some total reward [22]. Initial motivation for their study came from the modeling of clinical trials, as early as 1933 with the seminal work of [25]. In this example, arms correspond to different treatments with unknown, random effect. Since then, MAB models have been proved useful for many more applications, that range from cognitive radio [14] to online content optimization (news article recommendation [19], online advertising [10] or A/B Testing [16]), or portfolio optimization [23].

While the number of patients involved in a clinical study (and thus the number of treatments to select) is often decided in advance, in other contexts the total number of decisions to make (the horizon ) is unknown. It may correspond to the total number of visitors of a website optimizing its displays for a certain period of time, or to the number of attempted communications in a smart radio device. Hence in such cases, it is crucial to devise *anytime algorithms*, that is algorithms that do no rely on the knowledge of this horizon to sequentially select arms. A general way to turn any base algorithm into an anytime algorithm is the use of the so-called Doubling Trick, first proposed by [4], that consists in repeatedly running the base algorithm with increasing horizons. Motivated by the frequent use of this technique and the absence of a generic study of its effect on the algorithm’s efficiency, this paper investigates in details two families of doubling sequences (geometric and exponential), and shows that the former should be avoided for stochastic problems.

More formally, a MAB model is a set of arms, each arm being associated to a (unknown) *reward stream* . Fix a finite (possibly unknown) horizon. At each time step an agent selects an arm and receives as a reward the current value of the associated reward stream, . The agent’s decision strategy (or *bandit algorithm*) is such that can only rely on the past observations , on external randomness and (possibly) on the knowledge of the horizon . The objective of the agent is to find an algorithm that maximizes the expected cumulated rewards, where the expectation is taken over the possible randomness used by the algorithm and the possible randomness in the generation of the rewards stream. In the oblivious case, in which the reward streams are independent of the algorithm’s choice, this is equivalent to minimizing the *regret*, defined as

This quantity, referred to as *pseudo-regret* in [7], quantifies the difference between the expected cumulated reward of the best fixed action, and that of the strategy . For the general adversarial bandit problem [6], in which the rewards streams are arbitrary (picked by an adversary), a *worst-case* lower bound has been given. It says that for every algorithm, there exists (stochastic) reward streams such that the regret is larger than [6]. Besides, the algorithm has been shown to have a regret of order .

Much smaller regret may be obtained in *stochastic* MAB models, in which the reward stream from each arm is assumed to be *i.i.d.*, from some (unknown) distribution , with mean . In that case, various algorithms have been proposed with *problem-dependent* regret upper bounds of the form where is a constant that only depend on the arms distributions. Different assumptions on the arms distributions lead to different problem-dependent constants. In particular, under some parametric assumptions (*e.g.*, Gaussian distributions, exponential families), *asymptotically optimal* algorithms have been proposed and analyzed (e.g., [8] or Thompson sampling [1]), for which the constant obtained in the regret upper bound matches exactly that of the lower bound given by [17]. Under the non-parametric assumption that the are bounded in , the regret of the algorithm [5] is of the above form with , where is the mean of the best arm. Like in this last example, all the available constants become very large on “hard” instances, in which some arms are very close to the best arm. On such instances, may be much larger than the worst-case , and distribution-independent guarantees may actually be preferred.

The algorithm, proposed by [2], is the first stochastic bandit algorithm to enjoy a problem-dependent logarithmic regret and to be optimal in a *minimax* sense, as its regret is proved to be upper bounded by , for bandit models with rewards in . However the corresponding constant is proportional to , where is the minimal gap, which worsen the constant of . Another drawback of is that it is *not* anytime. These two shortcoming have been overcame recently in two different works. On the one hand, the algorithm [11] is minimax optimal and anytime, but its problem-dependent regret does not improve that of . On the other hand, the algorithm [21] is simultaneously minimax optimal and asymptotically optimal (*i.e.*, it has the best problem-dependent constant ), but it is not anytime. A natural question is thus to know whether a Doubling Trick could overcome this limitation.

This question is the starting point of our comprehensive study of the Doubling Trick: can a single Doubling Trick be used to preserve both problem-dependent (logarithmic) regret and minimax (square-root) regret? We answer this question in the negative, by showing that two different types of Doubling Trick may actually be needed. In this paper, we investigate how algorithms enjoying regret guarantees of the generic form

may be turned into an anytime algorithm enjoying *similar* regret guarantees with an appropriate Doubling Trick. This does not come for free, and we exhibit a “price of Doubling Trick”, that is a constant factor larger than , referred to as a *constant manipulative loss*.

**Outline** The rest of the paper is organized as follows. The Doubling Trick is formally defined in Section 2, along with a generic tool for its analysis. In Section 3, we present upper and lower bounds on the regret of algorithms to which a geometric Doubling Trick is applied. Section 4 investigates regret guarantees that can be obtained for a “faster” exponential Doubling Trick. Experimental results are then reported in Section 5. Complementary elements of proofs are deferred to the appendix.

The Doubling Trick, denoted by , is a general procedure to convert a (possibly non-anytime) algorithm into an anytime algorithm. It is formally stated below as Algorithm Figure 1 and depends on a non-decreasing diverging *doubling sequence* (*i.e.*, for ). fully restarts the underlying algorithm at the beginning of each new sequence (at ), and run this algorithm on a sequence of length ().

**Related work.** The Doubling Trick is a well known idea in online learning, that can be traced back to [4]. In the literature, the term Doubling Trick usually refers to the geometric sequence , in which the horizon is actually *doubling*, that was popularized by [9] in the context of adversarial bandits. Specific doubling tricks have also been used for stochastic bandits, for example in the work of [3], which uses the doubling sequence to turn the algorithm into an anytime algorithm.

**Elements of regret analysis.** For a sequence , with for all , we denote , and is always taken non-zero, (*i.e.*, ). We only consider *non-decreasing* and *diverging* sequences (that is, , and for ).

**Definition 1: Last Term ** For a non-decreasing diverging sequence and , we can define by

It is simply denoted when there is no ambiguity (*e.g.*, if the doubling sequence is chosen).

reinitializes its underlying algorithm at each time , and in generality the total regret is upper bounded by the regret on each sequence . By considering the last partial sequence , this splitting can be used to get a generic upper bound by taking into account a larger last sequence (up to ). And for stochastic bandit models, the *i.i.d.* hypothesis on the rewards streams makes the splitting on each sequence an equality, so we can also get a lower bound by excluding the last partial sequence. Lemma 1 is proved in Appendix A.1.

**Lemma 1: Regret Lower and Upper Bounds for ** For *any bandit model* and algorithm and horizon , one has the generic upper bound

Under a *stochastic bandit model*, one has furthermore the lower bound

As one can expect, the key to obtain regret guarantees for a Doubling Trick algorithm is to choose correctly the doubling sequence . Empirically, one can verify that sequences with slow growth gives terrible results, and for example using an arithmetic progression typically gives a linear regret. Building on this result, we will prove that if satisfies a certain regret bound (, , or ) then an appropriate anytime version of with a certain doubling trick can conserve the regret bound, with an explicit constant multiplicative loss . In this paper, we study in depth two families of sequences: first geometric and then exponential growths.

We define geometric doubling sequences, and prove that they can be used to conserve bounds in with but cannot be used to conserve bounds in .

**Definition 2: Geometric Growth** For , and , the sequence defined by is non-decreasing and diverging, and satisfies

Asymptotically for and , and .

A geometric doubling sequence allows to conserve a minimax bound (*i.e.*, ). It was suggested, for instance, in [9]. We generalize this result in the following theorem, proved in Appendix A.2, by extending it to bounds of the form instead of just , for *any* and . Note that no distinction is done on the case neither in the expression of the constant loss, nor in the proof.

**Theorem 1** If an algorithm satisfies , for , and for , and an increasing function (at ), then the anytime version with the geometric sequence of parameters (*i.e.*, ) *with the condition* if , satisfies,

with a increasing function , and a *constant loss* ,

For a fixed and , minimizing does not always give a unique solution. On the one hand, if , there is a unique solution minimizing the term, solution of , but without a closed form if is unknown. On the other hand, for any , the term depending on tends quickly to when increases.

**Practical considerations.** Empirically, when and are fixed and known, there is no need to minimize jointly. It can be minimized separately by first minimizing , that is by solving numerically (*e.g.*, with Newton’s method), and then by taking large enough so that the other term is close enough to .

For the usual case of and (*i.e.*, bounds in ), the optimal choice of is leading to , and the usual choice of gives (see Corollary 1 in appendix). Any large enough gives similar performance, and empirically is preferred, as most algorithms explore each arm once in their first steps (*e.g.*, for for our experiments).

We observe that the constant loss in Eq. from the previous Theorem 1 blows up when goes to zero, giving the intuition that no geometric doubling trick could be used to preserve a logarithmic bound (*i.e.*, with , ). This is confirmed by the lower bound given below.

**Theorem 2** For *stochastic models*, if satisfies , for and , then the anytime version with the geometric sequence of parameters (*i.e.*, ) satisfies this lower bound for a certain constant ,

Moreover, this implies that , which proves that a geometric sequence *cannot* be used to conserve a logarithmic regret bound.

If the regret is lower bounded at finite horizon by , without surprise we can prove that a geometric doubling sequence gives a comparable lower bound for the Doubling Trick algorithm (see Th. 5 in Appendix B). But more importantly, we show with Theorem 2 that a geometric sequence *cannot* be used to conserve a finite-horizon lower bound like .

This special case () is the most interesting, as efficient algorithms for stationary bandits must match the lower bound from [17]: if satisfies both (finite-time) logarithmic lower and upper bound (*i.e.*, if is bounded for large enough), then using a geometric sequence for the doubling trick algorithm is a bad idea as it guarantees a blow up in the regret lower bound, by implying . This result is the reason why we need to study successive horizons growing faster than a geometric sequence (*i.e.*, such that ), like the exponential sequence studied in next Section 4.

Let and consider a fixed *stochastic* bandit problem. The lower bound from Lemma 1 gives

We bound for any , thanks to Definition 2, and we can use the hypothesis on for each regret term.

Let . If we have (see below () for a discussion on the other case), then Lemma 5 (Eq. ) gives as . For lower bounds, there is no need to handle the constants tightly, and we have by hypothesis, so let call this constant , and thus it simplifies to

A sum-integral minoration for the increasing function (as ) gives (if ), and so

For the geometric sequence, we know that , and at so there exists a constant such that for large enough (), and such that . And thus we just proved that there is a constant such that

So this proves that for large enough, with , and so , which also implies that *cannot* be a .

() If we do not have the hypothesis , the same proof could be done, by observing that from large enough, we have (as soon as , *i.e.*, ), and so the same arguments can be used, to obtain a sum from instead of from . For a fixed , we also have for a (small enough) constant , and thus we obtain the same result.

We define exponential doubling sequences, and prove that they can be used to conserve bounds in , unlike the previously studied geometric sequences. Furthermore, we provide elements showing that they may also conserve bounds in or .

**Definition 3: Exponential Growth** For , and , if , then the sequence defined by is non-decreasing and diverging, and satisfies

Asymptotically for and , and .

An exponential doubling sequence allows to conserve a problem-dependent bound on regret (*i.e.*, ). This was already used by [3] in a particular case, and for example more recently by [20]. We generalize this result in the following theorem.

**Theorem 3** If an algorithm satisfies , for , , and for , and an increasing function (at ), then the anytime version with the exponential sequence of parameters (*i.e.*, ), satisfies the following inequality,

with an increasing function , and a *constant loss* ,

This result is as rich as Theorem 1 but for logarithmic bounds (*i.e.*, ), with no loss but a constant . It can also be applied to bounds of the generic form (*i.e.*, with ), but with a *significant loss* from to , and a constant loss . It is important to notice that for , the loss can be arbitrarily made small (by taking a first step large enough). This observation is encouraging, and let the authors think that a tight upper bound could be proved.

**Corollary 2** In particular, order-optimal and optimal algorithms for problem-dependent bound have and , as well as , in which case Theorem 3 gives a simpler bound, as for any , , and so

Remark: the optimal choice of gives a constant loss of , which is twice better than the loss of obtained^{1}

We observe that the constant loss from Theorem 3 blows up for (as ), and we give below in Theorem 4 another result toward a more precise lower bound.

The lower bound from Theorem 4 below is a second step towards proving or disproving the converse of the lower bound of Theorem 2, for bounds of the form (with or without ). and an exponential sequence. This result, along with the previous one Theorem 3, starts to answer the question of whether an exponential sequence is sufficient to conserve both minimax and problem-dependent guarantees with Doubling Tricks. Theorem 4 is proved in Appendix A.3.

**Theorem 4** For *stochastic models*, if satisfies , for and , then the anytime version with the exponential sequence of parameters , , (*i.e.*, ), satisfies this lower bound for a certain constant ,

If we could just take in the two previous Theorems 3 and 4, both results would match and prove that the exponential doubling trick can indeed be used to conserve minimax regret bounds. But obviously cannot depend on , and even if it can be taken as small as we want, it cannot be taken to (it has to be constant when ).

Let , and consider a fixed bandit problem. We first consider the harder case of , see below in for the other case. The lower bound from Lemma 1 gives

We bound naively , and we can use the hypothesis on for each regret term, as and are non-decreasing for (by hypothesis for and by Lemma 6.

The first part is denoted and is dealt with Lemma 7: the sum of is a , as by hypothesis, and this sum of is proved below to be bounded by . So . The second part is . Define , so whether or , we always have . Then we can use Lemma 4 (Eq. ) to distribute the power on (as it is ). So with the convention that (even if ), and so this gives

If then the first sum is just which can be included in (still increasing), and so only the second sum has to be bounded, and a geometric sum gives . But if , we can naively bound the first sum by Observe that . So and , thus the first sum is a (as ). In both cases, the first sum can be included in which is still a Another geometric sum bounds the second sum by .

We identify a constant multiplicative loss . The only term left which depends on is , and it can be bounded by using (as ), and again with . The constant part of also gives a term, that can be included in which is still a , and is still increasing as sum of increasing functions. So we can focus on the last term, and we obtain

So the constant multiplicative loss depends on and as well as on , and and is

If , the loss is minimal at and for a minimal value of (**for any** and ). Finally, the part tends to if so the loss can be made as small as we want, simply by choosing a large enough (but constant *w.r.t.* ).

Now for the other case of , we can start similarly, but instead of bounding naively by we use Lemma 3 to get a more subtle bound: . The constant term gets included in , and for the non-constant part, is handled similarly. Finally we obtain the loss

We illustrate here the practical cost of using Doubling Trick, for two interesting non-anytime algorithms that have recently been proposed in the literature: *Approximated Finite-Horizon Gittins indexes*, that we refer to as , by [18] (for Gaussian bandits with known variance) and by [21] (for Bernoulli bandits).

We first provide some details on these two algorithms, and then illustrate the behavior of Doubling Tricks applied to these algorithms with different doubling sequences.

We denote by the accumulated rewards from arm , and the number of times arm was sampled. Both algorithms assume to know the horizon . They compute an index for each arm at each time step , and use the indexes to choose the arm with highest index, *i.e.*, (ties are broken uniformly at random).

The algorithm can be applied for Gaussian bandits with variance ( for our experiments). Let , and let

The algorithm can be applied for bounded rewards in , and in particular for Bernoulli bandits. The binary Kullback-Leibler divergence is (for ), and let . Let the function for , and finally let

We present some results from numerical experiments on Bernoulli and Gaussian bandits. More results are presented in Appendix E. We present in pages and results for arms and horizon (to ensure that no choice of sequence were lucky and had or too close to it). We ran repetitions of the random experiment, either on the same “easy” bandit problem (with evenly spaced means), or on different random instances sampled uniformly in , and we plot the average regret on simulations. The black line without markers is the (asymptotic) lower bound in , from [17]. We consider for Bernoulli bandits (Figures Figure 2, Figure 3) or for Gaussian bandits (Figures Figure 4, Figure 6),

Each doubling trick algorithm uses the same as a first guess for the horizon. We include both the non-anytime version that knows the horizon , and different anytime versions to compare the choice of doubling trick. To compare against an algorithm that does not need the horizon, we also include by [8] as a baseline for Bernoulli problems, and by [5] (knowing the variance ) for Gaussian problems. The doubling sequences we consider are a geometric one with , and different exponential sequences: the “classical” one with and the “slow” one with both with , and the last one is using , . Despite what was proved theoretically in Theorem 3, using and a large enough improves regarding to using and a leading factor.

Another version of the Doubling Trick with “no restart”, denoted , is presented in Appendix C, but it is only an heuristic and cannot be applied to any algorithm . Algorithm Figure 5 can be applied to or for instance, as they use just as a numerical parameter (see Eqs. Equation 13 and Equation 14), but its first limitation is that it cannot be applied to [13] or [24], or any algorithms based on arm eliminations, for example. A second limitation is the difficulty to analyze this “no restart” variant, due to the unpredictable effect on regret of giving non-uniform prior information to the underlying algorithm on each successive sequence. An interesting future work would be to analyze it, either in general or for a specific algorithm like . Despite its limitations, this heuristic exhibits as expected better empirical performance than , as can be observed in Appendix E.

We formalized and studied the well-known “Doubling Trick” for generic multi-armed bandit problems, that is used to automatically obtain an anytime algorithm from any non-anytime algorithm. Our results are summarized in Table ?. We show that a geometric doubling can be used to conserve minimax regret bounds (in ), with a constant loss (typically ), but cannot be used to conserve problem-dependent bounds (in ), for which a faster doubling sequence is needed. An exponential doubling sequence can conserve logarithmic regret bounds also with a constant loss, but it is still an open question to know if minimax bounds can be obtained for this faster growing sequence. Partial results of both a lower and an upper bound, for bounds of the generic form , let use believe in a positive answer.

It is still an open problem to know if an anytime algorithm can be both asymptotically optimal for the problem-dependent regret (*i.e.*, with the exact constant) and optimal in a minimax regret (*i.e.*, have a regret), but we believe that using a doubling trick on non-anytime algorithms like cannot be the solution. We showed that it cannot work with a geometric doubling sequence, and conjecture that exponential doubling trick would never bring the right constant either.

Bound Doubling | Geometric, | Exponential, |

Known to fail |