Multi-Player Bandits Revisited


Multi-player Multi-Armed Bandits (MAB) have been extensively studied in the literature, motivated by applications to Cognitive Radio systems. Driven by such applications as well, we motivate the introduction of several levels of feedback for multi-player MAB algorithms. Most existing work assume that sensing information is available to the algorithm. Under this assumption, we improve the state-of-the-art lower bound for the regret of any decentralized algorithms and introduce two algorithms, and , that are shown to empirically outperform existing algorithms. Moreover, we provide strong theoretical guarantees for these algorithms, including a notion of asymptotic optimality in terms of the number of selections of bad arms. We then introduce a promising heuristic, called , that can operate without sensing information, which is crucial for emerging applications to Internet of Things networks. We investigate the empirical performance of this algorithm and provide some first theoretical elements for the understanding of its behavior.


Multi-Armed Bandits; Decentralized algorithms; Reinforcement learning; Cognitive Radio; Opportunistic Spectrum Access.


Several sequential decision making problems under the constraint of partial information have been studied since the 1950s under the name of Multi-Armed Bandit (MAB) problems [25]. In a stochastic MAB model, an agent is facing unknown probability distributions, called arms in reference to the arms of a one-armed bandit (or slot machine) in a casino. Each time she selects (or draws) an arm, she receives a reward drawn from the associated distribution. Her goal is to build a sequential selection strategy that maximizes the total reward received. A class of algorithms to solve this problem is based on Upper Confidence Bounds (UCB), first proposed by [21] and further popularized by [6]. The field has been very active since then, with several algorithms proposed and analyzed, both theoretically and empirically, even beyond the stochastic assumption on arms, as explained in the survey by [11].

The initial motivation to study MAB problems arose from clinical trials (the first MAB model can be traced back to , by [28]), in which a doctor sequentially allocates treatments (arms) to patients and observes their efficacy (reward). More recently, applications of MAB have shifted towards sequential content recommendation, e.g. sequential display of advertising to customers or A/B testing [22]. In the mean time, MAB were found to be relevant to the field of Cognitive Radio (CR, [24]), and [15] first proposed to use for the Opportunistic Spectrum Access (OSA) problem, and successfully conducted experiments on real radio networks demonstrating its usefulness. For CR applications, each arm models the quality or availability of a radio channel (a frequency band) in which there is some background traffic (e.g., primary users paying to have a guaranteed access to the channel in the case of OSA). A smart radio device needs to insert itself in the background traffic, by sequentially choosing a channel to access and try to communicate on, seeking to optimize the quality of its global transmissions.

For the development of CR, a crucial step is to insert multiple smart devices in the same background traffic. With the presence of a central controller that can assign the devices to separate channels, this amounts to choosing at each time step several arms of a MAB in order to maximize the global rewards, and can thus be viewed as an application of the multiple-play bandit, introduced by [5] and recently studied by [20]. Due to the communication cost implied by a central controller, a more relevant model is the decentralized multi-player multi-armed bandit model, introduced by [23] and [3], in which players select arms individually and collisions may occur, that yield a loss of reward. Further algorithms were proposed in similar models by [27] and [17] (under the assumption that each arm is a Markov chain) and by [8] and [26] (for i.i.d. or piece-wise i.i.d. arms). The goal for every player is to select most of the time one of the best arms, without colliding too often with other players. A first difficulty relies in the well-known trade-off between exploration and exploitation: players need to explore all arms to estimate their means while trying to focus on the best arms to gain as much rewards as possible. The decentralized setting considers no exchange of information between players, that only know and , and to avoid collisions, players should furthermore find orthogonal configurations (i.e., the players use the best arms without any collision), without communicating. Hence, in that case the trade-off is to be found between exploration, exploitation and low collisions.

All these above-mentioned works are motivated by the OSA problem, in which it is assumed that sensing occurs, that is each smart device observes the availability of a channel (sample from the arm) before trying to transmit and possibly experiment a collision with other smart devices. However some real radio networks do not use sensing at all, e.g., emerging standards developed for Internet of Things (IoT) networks such as LoRaWAN. Thus, to take into account these new applications, algorithms with additional contraints on the available feedback have to be proposed within the multiple-player MAB model. Especially, the typical approach that combines a (single-player) bandit algorithm based on the sensing information –to learn the quality of the channels while targeting the best ones– with a low-complexity decentralized collision avoidance protocol, is no longer possible.

In this paper, we take a step back and present the different feedback levels possible for multi-player MAB algorithms. For each of them, we propose algorithmic solutions supported by both experimental and theoretical guarantees. In the presence of sensing information, our contributions are a new problem-dependent regret lower bound, tighter than previous work, and the introduction of two algorithms, and . Both are shown to achieve an asymptotically optimal number of selections of the sub-optimal arms, and for we furthermore establish a logarithmic upper bound on the regret, that follows from a careful control of the number of collisions. In the absence of sensing information, we propose the heuristic and investigate its performance. Our study of this algorithm is supported by (promising) empirical performance and some first (disappointing) theoretical elements.

The rest of the article is organized as follows. We introduce the multi-player bandit model with three feedback levels in Section 2, and give a new regret lower bound in Section 3. The , and algorithms are introduced in Section 4, with the result of our experimental study reported in Section 5. Theoretical elements are then presented in Section 6.

2Multi-Player Bandit Model with Different Feedback Levels

We consider a -armed Bernoulli bandit model, in which arm is a Bernoulli distribution with mean . We denote the i.i.d. (binary) reward stream for arm , that satisfies and that is independent from the other rewards streams. However we mention that our lower bound and all our algorithms (and their analysis) can be easily extended to one-dimensional exponential families (just like for the algorithm of [12]). For simplicity, we focus on the Bernoulli case, that is also the most relevant for Cognitive Radio, as it can model channel availabilities.

In the multi-player MAB setting, there are players (or agents), that have to make decisions at some pre-specified time instants. At time step , player selects an arm , independently from the other players’ selections. A collision occurs at time if at least two players choose the same arm. We introduce the two events, for and

that respectively indicate that a collision occurs at time for player and that a collision occurs at time on arm . Each player then receives (and observes) the binary rewards ,

In words, she receives the reward of the selected arm if she is the only one to select this arm, and a reward zero otherwise1. Other models for rewards loss have been proposed in the literature (e.g., the reward is randomly allocated to one of the players selecting it), but we focus on full reward occlusion in this article.

A multi-player MAB strategy is a tuple of arm selection strategies for each player, and the goal is to propose a strategy that maximizes the total reward of the system, under some constraints. First, each player should adopt a sequential strategy , that decides which arm to select at time based on previous observations. Previous observations for player at time always include the previously chosen arms and received rewards for , but may also include the sensing information or the collision information . More precisely, depending on the application, one may consider the following three observation models, (I), (II) and (III).

Under each of these three models, we define to be the filtration generated by the observations gathered by player up to time (which contains different information under models (I), (II) and (III)). While a centralized algorithm may select the vector of actions for all players based on all the observations from , under a decentralized algorithm the arm selected at time by player only depends on the past observation of this player. More formally, is assumed to be -measurable.

Definition 1 We denote by the best mean, the second best etc, and by the (non-sorted) set of the indices of the arms with largest mean (best arms): if then . Similarly, denotes the set of indices of the arms with smallest means (worst arms), . Note that they are both uniquely defined if .

Following a natural approach in the bandit literature, we evaluate the performance of a multi-player strategy using the expected regret, that measures the performance gap with respect to the best possible strategy. The regret of the strategy at horizon is the difference between the cumulated reward of an oracle strategy, assigning in this case the players to , and the cumulated reward of strategy :

Maximizing the expected sum of global reward of the system is indeed equivalent to minimizing the regret, and we now investigate the best possible regret rate of a decentralized multi-player algorithm.

3An Asymptotic Regret Lower Bound

In this section, we provide a useful decomposition of the regret (Lemma 1) that permits to establish a new problem-dependent lower bound on the regret (Theorem 1), and also provides key insights on the derivation of regret upper bounds (Lemma 3).

3.1A Useful Regret Decomposition

We introduce additional notations in the following definition.

Definition 2 Let , and denote the number of selections of arm by any player , up to time .

Let be the number of colliding players2 on arm up to horizon :

Letting be the set of bandit instances such that there is a strict gap between the best arms and the other arms, we now provide a regret decomposition for any .

Lemma 1 For any bandit instance such that , it holds that (Proved in Appendix A.1)

In this decomposition, term (a) counts the lost rewards due to sub-optimal arms selections (), term (b) counts the number of times the best arms were not selected (), and term (c) counts the weighted number of collisions, on all arms. It is valid for both centralized and decentralized algorithms. For centralized algorithms, due to the absence of collisions, (c) is obviously zero, and (b) is non-negative, as . For decentralized algorithms, (c) may be significantly large, and term (b) may be negative, as many collisions on arm may lead to . However, a careful manipulation of this decomposition (see Appendix A.2) shows that the regret is always lower bounded by term (a).

Lemma 2 For any strategy and , it holds that .

3.2An Improved Asymptotic Lower Bound on the Regret

To express our lower bound, we need to introduce as the Kullback-Leibler divergence between the Bernoulli distribution of mean and that of mean , so that We first introduce the assumption under which we derive a regret lower bound, that generalizes a classical assumption made by [21] in single-player bandit models.

Definition 3 A strategy is strongly uniformly efficient if for all and for all ,

Having a small regret on every problem instance, i.e., uniform efficiency, is a natural assumption for algorithms, that rules out algorithms tuned to perform well on specific instances only. From this assumption and the decomposition of Lemma 1 one can see3 that for every , , and so

The additional assumption in further implies some notion of fairness, as it suggests that each of the players spends on average the same amount of time on each of the best arms. Note that this assumption is satisfied by any strategy that is invariant under every permutation of the players, i.e., for which the distribution of the observations under is independent from the choice of permutation . In that case, it holds that for every arm and , hence and are equivalent, and strong uniform efficiency is equivalent to standard uniform efficiency. Note that all our proposed algorithms are permutation invariant and is thus an example of strongly uniformly efficient algorithm, as we prove in Section 6 that its regret is logarithmic on every instance .

We now state a problem-dependent asymptotic lower bound on the number of sub-optimal arms selections under a decentralized strategy that has access to the sensing information. This result, proved in Appendix B, yields an asymptotic logarithmic lower bound on the regret, also given in Theorem 1.

Theorem 1 Under observation models and , for any strongly uniformly efficient decentralized policy and ,

From Lemma 2, it follows that

Observe that the regret lower bound is tighter than the state-of-the-art lower bound in this setup given by [23], that states that

as for every and , (see Figure 7 in Appendix F.1). It is worth mentioning that [23] proved a lower bound under the more general assumption for that there exists some numbers such that whereas in Definition ? we make the choice . Our result could be extended to this case but we chose to keep the notation simple and focus on fair allocation of the optimal arms between players.

Interestingly, our lower bound is exactly a multiplicative constant factor away from the lower bound given by [5] for centralized algorithms (which is clearly a simpler setting). This intuitively suggests the number of players as the (multiplicative) “price of decentralized learning”. However, to establish our regret bound, we lower bounded the number of collisions by zero, which may be too optimistic. Indeed, for an algorithm to attain the lower bound , the number of selections of each sub-optimal arm should match the lower bound and term (b) and term (c) in the regret decomposition of Lemma 1 should be negligible compared to . To the best of our knowledge, no algorithm has been shown to experience only collisions so far, for every and .

A lower bound on the minimal number of collisions experienced by any strongly uniformly efficient decentralized algorithm would thus be a nice complement to our Theorem 1, and it is left as future work.

3.3Towards Regret Upper Bounds

A natural approach to obtain an upper bound on the regret of an algorithm is to upper bound separately each of the three terms defined in Lemma 1. The following result shows that term can be related to the number of sub-optimal selections and the number of collisions that occurs on the best arms.

Lemma 3 The term in Lemma 1 is upper bounded as (Proved in Appendix A.3)

This result can also be used to recover Proposition 1 from [4], giving an upper bound on the regret that only depends on the expected number of sub-optimal selections for – and the expected number of colliding players on the optimal arms for . Note that, in term (c) the number of colliding players on the sub-optimal arm may be upper bounded as .

In the next Section, we present an algorithm that has a logarithmic regret, while ensuring that the number of sub-optimal selections is matching the lower bound of Theorem 1.

4New Algorithms for Multi-Player Bandits

When sensing is possible, that is under observation models (I) and (II), most existing strategies build on a single-player bandit algorithm (usually an index policy) that relies on the sensing information, together with an orthogonalization strategy to deal with collisions. We present this approach in more details in Section 4.1 and introduce two new algorithms of this kind, and . Then, we suggest in Section 4.2 a completely different approach, called , that no longer requires an orthogonalization strategy as the collisions are directly accounted for in the indices that are used. can also be used under observation model (III) –without sensing–, and without the knowledge of .

4.1Two New Strategies Based on Indices and Orthogonalization: and

In a single-player setting, index policies are popular bandit algorithms: at each round one index is computed for each arm, that only depends on the history of plays of this arm and (possibly) some exogeneous randomness. Then, the arm with highest index is selected. This class of algorithms includes the UCB family, in which the index of each arm is an Upper Confidence Bound for its mean, but also some Bayesian algorithms like Bayes-UCB [18] or the randomized Thompson Sampling algorithm [28].

The approaches we now describe for multi-player bandits can be used in combination with any index policy, but we restrict our presentation to UCB algorithms, for which strong theoretical guarantees can be obtained. In particular, we focus on two types of indices: indices [6] and indices [12], that can be defined for each player in the following way. Letting the current sum of sensing information obtained by player for arm , (if ) is the empirical mean of arm for player and one can define the index

where is some exploration function. is usually taken to be in practice, and slightly larger in theory, which ensures that (see [12]). A classical (single-player) UCB algorithm aims at the arm with largest index. However, if each of the players selects the arm with largest UCB, all the players will end up colliding most of the time on the best arm. To circumvent this problem, several coordination mechanisms have emerged, that rely on ordering the indices and targeting one of the -best indices.

While the algorithm [23] relies on the player agreeing in advance on the time steps at which they will target each of the best indices (even though some alternative without pre-agreement are proposed), the algorithm [4] relies on randomly selected ranks. More formally, letting be the index of the -th largest entry in a vector , in each player maintains at time an internal rank and selects at time ,

If a collision occurs, a new rank is drawn uniformly at random: .

We now propose two alternatives to this strategy, that do not rely on ranks and rather randomly fix themselves on one arm in , that is defined as the set of arms that have the largest indices:

Our proposal is stated below as Algorithm Figure 1, while a simpler variant, called , is stated as Algorithm Figure 4 in Appendix C. We focus on as it is easier to analyze and performs better. Both algorithms ensure that player always selects at time an arm from . When a collision occurs randomly switches arm within , while uses a more sophisticated mechanism, that is reminiscent of “Musical Chair” (MC) and inspired by the work of [26]: players tend to fix themselves on arms (“chairs”) and ignore future collision when this happens.

Figure 1: The \mathrm{MCTopM} decentralized learning policy (for a fixed underlying index policy g^j).
Figure 1: The decentralized learning policy (for a fixed underlying index policy ).

More precisely, under , if player did not encounter a collision when using arm at time , then she marks her current arm as a “chair” (), and will keep using it even if collisions happen in the future (Lines -). As soon as this “chair” is no longer in , a new arm is sampled uniformly from a subset of , defined with the previous indices (Lines -). The subset enforces a certain inequality on indices, and , when switching from to . This helps to control the number of such changes of arm, as shown in Lemma 5. The considered subset is never empty as it contains at least the arm replacing the in . Collisions are dealt with only for non-fixed player , and when the previous arm is still in . In this case, a new arm is sampled uniformly from (Lines -). This stationary aspect helps to minimize the number of collisions, as well as the number of switches of arm. The five differents transitions ,,,, refer to the notations used in the analysis of (see Figure 5 in Appendix D.3).

4.2The Approach

Under observation model (III) no sensing information is available and the previous algorithms cannot be used, as the sum of sensing information and thus the empirical mean cannot be computed, hence neither the indices . However, one can still define a notion of empirical reward received from arm by player , by introducing

Note that is no longer meant to be an unbiased estimate of as it also takes into account the collision information, that is present in the reward. Based on this empirical reward, one can similarly defined modified indices as

Given any of these two index policies ( or ), the algorithm is defined by,

The name comes from the fact that each player is targeting, in a “selfish” way, the arm that has the highest index, instead of accepting to target only one of the best. The reason that this may work precisely comes from the fact that is no longer an upper-confidence on , but some hybrid index that simultaneously increases when a transmission occurs and decreases when a collision occurs.

This behavior is easier to be understood for the case of - in which, letting be the number of collisions on arm , one can show that the hybrid index induces a penalty proportional to the fraction of collision on this arm and the quality of the arm itself:

From a bandit perspective, it looks like each player is using a stochastic bandit algorithm ( or ) when interacting with arms that give a feedback (the reward, and not the sensing information) that is far from being i.i.d. from some distribution, due to the collisions. As such, the algorithm does not appear to be well justified, and one may rather want to use adversarial bandit algorithms like [7], that do not require a stochastic (i.i.d.) assumption on arms. However, we found out empirically that is doing surprisingly well, as already noted by [10], who did some experiments in the context of IoT applications. We show in Section 6 that does have a (very) small probability to fail (badly), for some problem with small , which precludes the possibility of a logarithmic regret for any problem. However, in most cases it empirically performs similarly to all the algorithms described before, and usually outperforms , even if it neither exploits the sensing information, nor the knowledge of the number of players . As such, practitioners may still be interested by the algorithm, especially for Cognitive Radio applications in which sensing is hard or not considered.

5Empirical performances

We illustrate here the empirical performances of the algorithms presented in Section 4, used with two index policies, and . Some plots are at pages and and most of them in Appendix F.2.

As expected, the same algorithm always performs better using rather than , as the theoretical guarantees for single player suggested, and the main goal of this work is not to optimize on the index policy but rather propose new ways of using indices, in a decentralized setting4, so we only consider in the experiments. We chose to use two example problems for the illustrations: the first one is with arms and means5 , for which two cases and are presented in Figure 9. The second problem is with and , for which three cases are presented, for in Figure 3, and for the two limit cases with and in Figure 14.

Performance is measured using the average regret, with repetitions on the same problem, from to the horizon (also unknown by the algorithms), but we also include histograms showing the distribution of regret at , as this allows to check if the regret is indeed small for each run of the simulation. For the plots showing the regret, our asymptotic lower bound from Theorem 1 is displayed.

Experiments with a different problem for each repetition (uniformly sampled ), are also considered, in Figure 2 and Figure 12. This helps to check that no matter the complexity of the considered problem (one measure of complexity being the constant in our lower bound), performs similarly or better than all the other algorithms, and outperforms in most cases. Figure 2 is a good example of outstanding performances of and in comparison to .

Figure 2: Regret, M=6 players, K=9 arms, horizon T=5000, against 500 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. \mathrm{Rho}\mathrm{Rand} (top blue curve) is outperformed by the other algorithms (and the gain increases with M). \mathrm{MCTopM} (bottom yellow) outperforms all the other algorithms is most cases.
Figure 2: Regret, players, arms, horizon , against problems uniformly sampled in . (top blue curve) is outperformed by the other algorithms (and the gain increases with ). (bottom yellow) outperforms all the other algorithms is most cases.

Empirically, our proposals were found to almost always outperform , and except for that can fail badly on problems with small , we verified that outperforms the state-of-the-art algorithms in many different problems, and is more and more efficient as and grows.

Two other multi-player multi-armed bandits algorithms are by [8] and by [26], that were proposed for the observation model (II). We chose to not include comparisons with any of them, as they were both found hard to use efficiently in practice. Indeed, needs a careful tuning of five parameters (, , , and ) to attain reasonable performances (and using cross validation, as the authors suggested, is usually considered out of the scope of online sequential learning). was shown to attain a constant regret (with high probability) only for very large horizons, and empirically it requires a fine tuning of its parameter , as the theoretical value requires to know the small gap between two arms, unavailable in practice.

Figure 3: Regret (in \log\log scale), for M=6 players for K=9 arms, horizon T=5000, for 1000 repetitions on problem \boldsymbol{\mu}=[0.1,\dots,0.9]. \mathrm{RandTopM} (yellow curve) outperforms \mathrm{Selfish} (green), both clearly outperform \mathrm{Rho}\mathrm{Rand}. The regret of \mathrm{MCTopM} is logarithmic, empirically with the same slope as the lower bound. The x axis on the regret histograms have different scale for each algorithm.
Regret (in \log\log scale), for M=6 players for K=9 arms, horizon T=5000, for 1000 repetitions on problem \boldsymbol{\mu}=[0.1,\dots,0.9]. \mathrm{RandTopM} (yellow curve) outperforms \mathrm{Selfish} (green), both clearly outperform \mathrm{Rho}\mathrm{Rand}. The regret of \mathrm{MCTopM} is logarithmic, empirically with the same slope as the lower bound. The x axis on the regret histograms have different scale for each algorithm.
Figure 3: Regret (in scale), for players for arms, horizon , for repetitions on problem . (yellow curve) outperforms (green), both clearly outperform . The regret of is logarithmic, empirically with the same slope as the lower bound. The axis on the regret histograms have different scale for each algorithm.

6Theoretical elements

Section 6.1 gives an asymptotically optimal analysis of the expected number of sub-optimal draws for , and combined with indices, and Section 6.2 proves that the number of collisions, hence the regret of are also logarithmic. Section 6.3 shortly discusses a disappoiting result regarding , with more insights provided in Appendix E.

6.1Common Analysis for - and -

Lemma 4 gives a finite-time upper bound on the expected number of draws of a sub-optimal arm for any player , that holds for both - and -. Our improved analysis also applies to the algorithm. Explicit formulas for , are in the proof in Appendix D.1.

Lemma 4 For any , if player uses the -, - or - decentralized policy with exploration function , then for any sub-optimal arm , there exists two problem depend constants , such that

It is important to notice that the leading constant in front of is the same as in the constant featured in Equation of Theorem 1. This result proves that the lower bound on sub-optimal selections is asymptotically matched for the three considered algorithms. This is a strong improvement in comparison to the previous state-of-the-art results [23].

As announced, Lemma 5 controls the number of switches of arm that are due to the current arm leaving , for both and . It essentially proves that Lines - in Algorithm Figure 1 (when a new arm is sampled from the non-empty subset of ) happen a logarithmic number of times. The proof of this result is given in Appendix D.2.

Lemma 5 For any , any player using - or -, and any arm , it holds that

6.2Regret Analysis of -

For , we are furthermore able to obtain a logarithmic regret upper bound, by proposing an original approach to control the number of collisions under this algorithm. First, we can bound the number of collisions by the number of collisions for players not yet “fixed on their arms” (), that we can then bound by the number of changes of arms (cf proof in Appendix D.3). An interesting consequence of the proof of this result is that it also bounds the number of switches of arms, , and this additional guarantee was never clearly stated for previous state-of-the-art works, like . Even though minimizing switching was not a goal6, this guarantee is interesting for Cognitive Radio applications, where switching arms means reconfiguring a radio hardware, an operation that costs energy.

Lemma 6 For any , if all players use the - decentralized policy, and , then the total average number of collisions (on all arms) is upper-bounded by (Proved in Appendix D.3)

Note that this bound is in , which is significantly better than proved by [4] for . But it is worse than for proved by [26], due to our trick of focusing on collisions for non-sitted players.

Now that the sub-optimal arms selections and the collisions are both proved to be at most logarithmic in Lemmas 4 and 5, it follows from our regret decomposition (Lemma 1) together with Lemma 3 that the regret of - is logarithmic. More precisely, one obtains a finite-time problem-depend upper bound on the regret of this algorithm.

Theorem 2 If all players use -, and , then for any problem , there exists a problem dependent constant , such that the regret satisfies:

6.3Discussion on

The analysis of is harder, but we tried our best to obtain some understanding of the behavior of this algorithm, that seems to be doing surpisingly well in many contexts, as in our experiments with arms and in extensive experiments not reported in this paper. However, a disappointing result is that we found simple problems, usually with small number of arms, for which the algorithm may fail. For example with or players competing for arms, with means , the histograms in Figure 9 suggests that with a small probability, the regret of - can be very large. We provide a discussion in Appendix E about when such situations may happen, including a conjectured (constant, but small) lower bound on the probability that experience collision almost at every round. This result would then prevent from having a logarithmic regret. However, it is to be noted that the lower bound of Theorem 1 does not apply to the censored observation model (III) under which operates, and it is not known yet whether logarithmic regret is at all possible.

7Conclusion and future work

To summarize, we presented three variants of Multi-Player Multi-Arm Bandits, with different level of feedback being available to the decentralized players, under which we proposed efficient algorithms. For the two easiest models –with sensing–, our theoretical contribution improves both the state-of-the-art upper and lower bounds on the regret. In the absence of sensing, we also provide some motivation for the practical use of the interesting heuristic, a simple index policy based on hybrid indices that are directly taking into account the collision information.

This work suggests several interesting further research directions. First, we want to investigate the notion of optimal algorithms in the decentralized multi-player model with sensing information. So far we provided the first matching upper and lower bound on the expected number of sub-optimal arms selections, which suggests some form of (asymptotic) optimality. However, sub-optimal draws turn out not be the dominant terms in the regret, both in our upper bounds and in practice, thus an interesting future work is to identify some notion of minimal number of collisions. Second, it remains an open question to know if a simple decentralized algorithm can be as efficient as without knowing in advance, or in dynamic settings (when can change in time). We shall start by proposing variants of our algorithm that are inspired by the variant of proposed by [4]. Finally, we want to strengthen the guarantees obtained in the absence of sensing, that is to know whether logarithmic regret is achievable and to have a better analysis of the approach. Indeed, in most cases, it performs comparably to even with limited feedback and without knowing the number of players , which makes it a good candidate for applications to Internet of Things networks.

ARegret Decomposisions

a.1Proof of Lemma 1

Using the definition of regret from , and this collision indicator ,

The last equality comes from the linearity of expectations, and the fact that (for all , from the i.i.d. hypothesis), and the independence from , and (observed after playing ). So . And so

For the first term, we have , and if we denote the average mean of the -best arms, then,

Let be the gap between the mean of the arm and the -best average mean, and if denotes the index of the worst of the -best arms (i.e., ), then by splitting into three disjoint sets , we get

But for , , so by recombining the terms, we obtain,

The term simplifies to , and so by definition of . And for , , so the first sum can be written for only, so

And so we obtain the decomposition with three terms (a), (b) and (c).

a.2Proof of Lemma 2

Note that term (c) is clearly lower bounded by but it is not obvious for (b) as there is no reason for to be upper bounded by . Let , where the notation stands for “there exists a unique”. Then can be decomposed as