Multi-Player Bandits Revisited

Lilian Besson

^{1, 2}

, Emilie Kaufmann

^{2}

^{1}

CentraleSupélec, IETR–SCEE, Cesson-Sévigné, France, Lilian.Besson AT CentraleSupelec.fr

^{2}

CNRS

&

CRIStAL, Univ. Lille 1, Inria SequeL, Lille, France, Emilie.Kaufmann AT Univ-Lille1.fr

This work is supported by the French National Research Agency (ANR), under the project BADASS (grant coded: N ANR-16-CE40-0002), by the CNRS, under the PEPS project BIO, by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay. Thanks to Christophe Moy, Rémi Bonnefoi and Vincent Gouldieff for useful discussions.

This document was generated with the open-source software Engrafo. It is nothing but experimental, please refer to the PDF version of our article instead.

Abstract

Multi-player Multi-Armed Bandits (MAB) have been extensively studied in the literature, motivated by applications to Cognitive Radio systems. Driven by such applications as well, we motivate the introduction of several levels of feedback for multi-player MAB algorithms. Most existing work assume that sensing information is available to the algorithm. Under this assumption, we improve the state-of-the-art lower bound for the regret of any decentralized algorithms and introduce two algorithms, $R a n d T o p M$ and $M C T o p M$ , that are shown to empirically outperform existing algorithms. Moreover, we provide strong theoretical guarantees for these algorithms, including a notion of asymptotic optimality in terms of the number of selections of bad arms. We then introduce a promising heuristic, called $S e l f i s h$ , that can operate without sensing information, which is crucial for emerging applications to Internet of Things networks. We investigate the empirical performance of this algorithm and provide some first theoretical elements for the understanding of its behavior.

Keywords:

Multi-Armed Bandits; Decentralized algorithms; Reinforcement learning; Cognitive Radio; Opportunistic Spectrum Access.

1Introduction

Several sequential decision making problems under the constraint of partial information have been studied since the 1950s under the name of Multi-Armed Bandit (MAB) problems [25]. In a stochastic MAB model, an agent is facing $K$ unknown probability distributions, called arms in reference to the arms of a one-armed bandit (or slot machine) in a casino. Each time she selects (or draws) an arm, she receives a reward drawn from the associated distribution. Her goal is to build a sequential selection strategy that maximizes the total reward received. A class of algorithms to solve this problem is based on Upper Confidence Bounds (UCB), first proposed by [21] and further popularized by [6]. The field has been very active since then, with several algorithms proposed and analyzed, both theoretically and empirically, even beyond the stochastic assumption on arms, as explained in the survey by [11].

The initial motivation to study MAB problems arose from clinical trials (the first MAB model can be traced back to $1933$ , by [28]), in which a doctor sequentially allocates treatments (arms) to patients and observes their efficacy (reward). More recently, applications of MAB have shifted towards sequential content recommendation, e.g. sequential display of advertising to customers or A/B testing [22]. In the mean time, MAB were found to be relevant to the field of Cognitive Radio (CR, [24]), and [15] first proposed to use ${U C B}_{1}$ for the Opportunistic Spectrum Access (OSA) problem, and successfully conducted experiments on real radio networks demonstrating its usefulness. For CR applications, each arm models the quality or availability of a radio channel (a frequency band) in which there is some background traffic (e.g., primary users paying to have a guaranteed access to the channel in the case of OSA). A smart radio device needs to insert itself in the background traffic, by sequentially choosing a channel to access and try to communicate on, seeking to optimize the quality of its global transmissions.

For the development of CR, a crucial step is to insert multiple $M \geq 2$ smart devices in the same background traffic. With the presence of a central controller that can assign the devices to separate channels, this amounts to choosing at each time step several arms of a MAB in order to maximize the global rewards, and can thus be viewed as an application of the multiple-play bandit, introduced by [5] and recently studied by [20]. Due to the communication cost implied by a central controller, a more relevant model is the decentralized multi-player multi-armed bandit model, introduced by [23] and [3], in which players select arms individually and collisions may occur, that yield a loss of reward. Further algorithms were proposed in similar models by [27] and [17] (under the assumption that each arm is a Markov chain) and by [8] and [26] (for i.i.d. or piece-wise i.i.d. arms). The goal for every player is to select most of the time one of the $M$ best arms, without colliding too often with other players. A first difficulty relies in the well-known trade-off between exploration and exploitation: players need to explore all arms to estimate their means while trying to focus on the best arms to gain as much rewards as possible. The decentralized setting considers no exchange of information between players, that only know $K$ and $M$ , and to avoid collisions, players should furthermore find orthogonal configurations (i.e., the $M$ players use the $M$ best arms without any collision), without communicating. Hence, in that case the trade-off is to be found between exploration, exploitation and low collisions.

All these above-mentioned works are motivated by the OSA problem, in which it is assumed that sensing occurs, that is each smart device observes the availability of a channel (sample from the arm) before trying to transmit and possibly experiment a collision with other smart devices. However some real radio networks do not use sensing at all, e.g., emerging standards developed for Internet of Things (IoT) networks such as LoRaWAN. Thus, to take into account these new applications, algorithms with additional contraints on the available feedback have to be proposed within the multiple-player MAB model. Especially, the typical approach that combines a (single-player) bandit algorithm based on the sensing information –to learn the quality of the channels while targeting the best ones– with a low-complexity decentralized collision avoidance protocol, is no longer possible.

In this paper, we take a step back and present the different feedback levels possible for multi-player MAB algorithms. For each of them, we propose algorithmic solutions supported by both experimental and theoretical guarantees. In the presence of sensing information, our contributions are a new problem-dependent regret lower bound, tighter than previous work, and the introduction of two algorithms, $R a n d T o p M$ and $M C T o p M$ . Both are shown to achieve an asymptotically optimal number of selections of the sub-optimal arms, and for $M C T o p M$ we furthermore establish a logarithmic upper bound on the regret, that follows from a careful control of the number of collisions. In the absence of sensing information, we propose the $S e l f i s h$ heuristic and investigate its performance. Our study of this algorithm is supported by (promising) empirical performance and some first (disappointing) theoretical elements.

The rest of the article is organized as follows. We introduce the multi-player bandit model with three feedback levels in Section 2, and give a new regret lower bound in Section 3. The $R a n d T o p M$ , $M C T o p M$ and $S e l f i s h$ algorithms are introduced in Section 4, with the result of our experimental study reported in Section 5. Theoretical elements are then presented in Section 6.

2Multi-Player Bandit Model with Different Feedback Levels

We consider a $K$ -armed Bernoulli bandit model, in which arm $k$ is a Bernoulli distribution with mean $μ_{k} \in [0, 1]$ . We denote $(Y_{k, t})_{t \in N}$ the i.i.d. (binary) reward stream for arm $k$ , that satisfies $P (Y_{k, t} = 1) = μ_{k}$ and that is independent from the other rewards streams. However we mention that our lower bound and all our algorithms (and their analysis) can be easily extended to one-dimensional exponential families (just like for the $k l - U C B$ algorithm of [12]). For simplicity, we focus on the Bernoulli case, that is also the most relevant for Cognitive Radio, as it can model channel availabilities.

In the multi-player MAB setting, there are $M \in {1, \dots, K}$ players (or agents), that have to make decisions at some pre-specified time instants. At time step $t \in N, t \geq 1$ , player $j$ selects an arm $A^{j} (t)$ , independently from the other players’ selections. A collision occurs at time $t$ if at least two players choose the same arm. We introduce the two events, for $j \in {1, \dots, M}$ and $k \in {1, \dots, K},$

C^{j} (t) := {\exists j^{'} \neq j : A^{j^{'}} (t) = A^{j} (t)} and C_{k} (t) := {# {j : A^{j} (t) = k} > 1},

that respectively indicate that a collision occurs at time $t$ for player $j$ and that a collision occurs at time $t$ on arm $k$ . Each player $j$ then receives (and observes) the binary rewards $r^{j} (t) \in {0, 1}$ ,

r^{j} (t) := Y_{A^{j} (t), t} 1 (¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ C^{j} (t)) .

In words, she receives the reward of the selected arm if she is the only one to select this arm, and a reward zero otherwise¹. Other models for rewards loss have been proposed in the literature (e.g., the reward is randomly allocated to one of the players selecting it), but we focus on full reward occlusion in this article.

A multi-player MAB strategy is a tuple $ρ = (ρ^{1}, \dots, ρ^{M})$ of arm selection strategies for each player, and the goal is to propose a strategy that maximizes the total reward of the system, under some constraints. First, each player $j$ should adopt a sequential strategy $ρ^{j}$ , that decides which arm to select at time $t$ based on previous observations. Previous observations for player $j$ at time $t$ always include the previously chosen arms $A^{j} (s)$ and received rewards $r^{j} (s)$ for $s < t$ , but may also include the sensing information $Y_{A^{j} (t), t}$ or the collision information $C^{j} (t)$ . More precisely, depending on the application, one may consider the following three observation models, (I), (II) and (III).

(I) Simultaneous sensing and collision

Player $j$ observes $Y_{A^{j} (t), t}$ and $C^{j} (t)$ (not previously studied).
(II) Sensing, then collision

Player $j$ observes $Y_{A^{j} (t), t}$ , then observes the reward, and thus also $C^{j} (t)$ only if $Y_{A^{j} (t), t} = 1$ . This common setup, studied for example by [4], is relevant to model the OSA problem: the device first checks for the presence of primary users in the chosen channel. If this channel is free ( $Y_{A^{j} (t), t} = 1$ ), the transmission is successful ( $r^{j} (t) = 1$ ) if no collision occurs with other smart devices ( $¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ C^{j} (t)$ ).
(III) No sensing

Player $j$ only observes the reward $r^{j} (t)$ . For IoT networks, this reward can be interpreted as an acknowledgement from a Base Station, received when a communication was successful. A lack of acknowledgment may be due to a collision with a device from the background traffic $(Y_{A^{j} (t), t} = 0)$ , or to a collision with one of the others players ( $C^{j} (t)$ ). However, the sensing and collision information are censored. Recently, [10] presented the first (bandit-based) algorithmic solutions under this (harder) feedback model, in a slightly different setup, more suited to large scale IoT applications.

Under each of these three models, we define $F_{t}^{j}$ to be the filtration generated by the observations gathered by player $j$ up to time $t$ (which contains different information under models (I), (II) and (III)). While a centralized algorithm may select the vector of actions for all players $(A^{1} (t), \dots, A^{M} (t))$ based on all the observations from $⋃_{j} F_{t - 1}^{j}$ , under a decentralized algorithm the arm selected at time $t$ by player $j$ only depends on the past observation of this player. More formally, $A^{j} (t)$ is assumed to be $F_{t - 1}^{j}$ -measurable.

Definition 1 We denote by $μ_{1}^{*}$ the best mean, $μ_{2}^{*}$ the second best etc, and by $M - b e s t$ the (non-sorted) set of the indices of the $M$ arms with largest mean (best arms): if $μ_{1}^{*} = μ_{k_{1}}, \dots, μ_{M}^{*} = μ_{k_{M}}$ then $M - b e s t = {k_{1}, \dots, k_{M}}$ . Similarly, $M - w o r s t$ denotes the set of indices of the $K - M$ arms with smallest means (worst arms), ${1, \dots, K} ∖ M - b e s t$ . Note that they are both uniquely defined if $μ_{M}^{*} > μ_{M + 1}^{*}$ .

Following a natural approach in the bandit literature, we evaluate the performance of a multi-player strategy using the expected regret, that measures the performance gap with respect to the best possible strategy. The regret of the strategy $ρ$ at horizon $T$ is the difference between the cumulated reward of an oracle strategy, assigning in this case the $M$ players to $M - b e s t$ , and the cumulated reward of strategy $ρ$ :

\begin{matrix} R_{T} (μ, M, ρ) := (M \sum k = 1 μ_{k}^{*}) T - E_{μ} [T \sum t = 1 M \sum j = 1 r^{j} (t)] . \\ (1) \end{matrix}

Maximizing the expected sum of global reward of the system is indeed equivalent to minimizing the regret, and we now investigate the best possible regret rate of a decentralized multi-player algorithm.

3An Asymptotic Regret Lower Bound

In this section, we provide a useful decomposition of the regret (Lemma 1) that permits to establish a new problem-dependent lower bound on the regret (Theorem 1), and also provides key insights on the derivation of regret upper bounds (Lemma 3).

3.1A Useful Regret Decomposition

We introduce additional notations in the following definition.

Definition 2 Let $T_{k}^{j} (T) := \sum_{t = 1}^{T} 1 (A^{j} (t) = k)$ , and denote $T_{k} (T) := \sum_{j = 1}^{M} T_{k}^{j} (T)$ the number of selections of arm $k \in {1, \dots, K}$ by any player $j \in {1, \dots, M}$ , up to time $T$ .

Let $C_{k} (T)$ be the number of colliding players² on arm $k \in {1, \dots, K}$ up to horizon $T$ :

C_{k} (T) := T \sum t = 1 M \sum j = 1 1 (C^{j} (t)) 1 (A^{j} (t) = k) .

Letting $P_{M} = {μ \in [0, 1]^{K} : μ_{M}^{*} > μ_{M + 1}^{*}}$ be the set of bandit instances such that there is a strict gap between the $M$ best arms and the other arms, we now provide a regret decomposition for any $μ \in P_{M}$ .

Lemma 1 For any bandit instance $μ \in P_{M}$ such that $μ_{M}^{*} > μ_{M + 1}^{*}$ , it holds that (Proved in Appendix A.1)

\begin{matrix} R_{T} (μ, M, ρ) = \sum k \in M - w o r s t (μ_{M}^{*} - μ_{k}) E_{μ} [T_{k} (T)]      (a) + \sum k \in M - b e s t (μ_{k} - μ_{M}^{*}) (T - E_{μ} [T_{k} (T)])      (b) + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)]      (c) . \\ (2) \end{matrix}

In this decomposition, term (a) counts the lost rewards due to sub-optimal arms selections ( $k \in M - w o r s t$ ), term (b) counts the number of times the best arms were not selected ( $k \in M - b e s t$ ), and term (c) counts the weighted number of collisions, on all arms. It is valid for both centralized and decentralized algorithms. For centralized algorithms, due to the absence of collisions, (c) is obviously zero, and (b) is non-negative, as $T_{k} (T) \leq T$ . For decentralized algorithms, (c) may be significantly large, and term (b) may be negative, as many collisions on arm $k$ may lead to $T_{k} (T) > T$ . However, a careful manipulation of this decomposition (see Appendix A.2) shows that the regret is always lower bounded by term (a).

Lemma 2 For any strategy $ρ$ and $μ \in P_{M}$ , it holds that $R_{T} (μ, M, ρ) \geq \sum k \in M - w o r s t (μ_{M}^{*} - μ_{k}) E_{μ} [T_{k} (T)]$ .

3.2An Improved Asymptotic Lower Bound on the Regret

To express our lower bound, we need to introduce $k l (x, y)$ as the Kullback-Leibler divergence between the Bernoulli distribution of mean $x \neq 0, 1$ and that of mean $y \neq 0, 1$ , so that $k l (x, y) := x log (x / y) + (1 - x) log ((1 - x) / (1 - y)) .$ We first introduce the assumption under which we derive a regret lower bound, that generalizes a classical assumption made by [21] in single-player bandit models.

Definition 3 A strategy $ρ$ is strongly uniformly efficient if for all $μ \in P_{M}$ and for all $α \in (0, 1)$ ,

\begin{matrix} R_{T} (μ, M, ρ) = T \to + \infty o (T^{α}) and \forall j \in {1, \dots, M}, k \in M - b e s t, \frac{T}{M} - E_{μ} [T_{k}^{j} (T)] = T \to + \infty o (T^{α}) . \\ (3) \end{matrix}

Having a small regret on every problem instance, i.e., uniform efficiency, is a natural assumption for algorithms, that rules out algorithms tuned to perform well on specific instances only. From this assumption $(R_{T} (μ, M, ρ) = o (T^{α}))$ and the decomposition of Lemma 1 one can see³ that for every $k \in M - b e s t$ , $T - E_{μ} [T_{k} (T)] = o (T^{α})$ , and so

M \sum j = 1 (\frac{T}{M} - E_{μ} [T_{k}^{j} (T)]) = o (T^{α}) .

The additional assumption in further implies some notion of fairness, as it suggests that each of the $M$ players spends on average the same amount of time on each of the $M$ best arms. Note that this assumption is satisfied by any strategy that is invariant under every permutation of the players, i.e., for which the distribution of the observations under $ρ^{γ} = (ρ^{γ (1)}, \dots, ρ^{γ (M)})$ is independent from the choice of permutation $γ \in Σ_{M}$ . In that case, it holds that $E_{μ} [T_{k}^{j} (T)] = E_{μ} [T_{k}^{j^{'}} (T)]$ for every arm $k$ and $(j, j^{'}) \in {1, \dots, M}$ , hence and are equivalent, and strong uniform efficiency is equivalent to standard uniform efficiency. Note that all our proposed algorithms are permutation invariant and $M C T o p M$ is thus an example of strongly uniformly efficient algorithm, as we prove in Section 6 that its regret is logarithmic on every instance $μ \in P_{M}$ .

We now state a problem-dependent asymptotic lower bound on the number of sub-optimal arms selections under a decentralized strategy that has access to the sensing information. This result, proved in Appendix B, yields an asymptotic logarithmic lower bound on the regret, also given in Theorem 1.

Theorem 1 Under observation models $(I)$ and $(I I)$ , for any strongly uniformly efficient decentralized policy $ρ$ and $μ \in P_{M}$ ,

\begin{matrix} \forall j \in {1, \dots, M}, \forall k \in M - w o r s t, liminf T \to \infty \frac{E_{μ} [T_{k}^{j} (T)]}{log (T)} \geq \frac{1}{k l (μ_{k}, μ_{M}^{*})} . \\ (5) \end{matrix}

From Lemma 2, it follows that

\begin{matrix} lim inf T \to + \infty \frac{R_{T} (μ, M, ρ)}{log (T)} \geq M \times ⎛ ⎝ \sum k \in M - w o r s t \frac{(μ_{M}^{*} - μ_{k})}{k l (μ_{k}, μ_{M}^{*})} ⎞ ⎠ . \\ (6) \end{matrix}

Observe that the regret lower bound is tighter than the state-of-the-art lower bound in this setup given by [23], that states that

\begin{matrix} lim inf T \to + \infty \frac{R_{T} (μ, M, ρ)}{log (T)} \geq \sum k \in M - w o r s t (M \sum j = 1 \frac{(μ_{M}^{*} - μ_{k})}{k l (μ_{k}, μ_{j}^{*})}), \\ (7) \end{matrix}

as for every $k \in M - w o r s t$ and $j \in {1, \dots, M}$ , $k l (μ_{k}, μ_{j}^{*}) \geq k l (μ_{k}, μ_{M}^{*})$ (see Figure 7 in Appendix F.1). It is worth mentioning that [23] proved a lower bound under the more general assumption for $ρ$ that there exists some numbers $(a_{k}^{j})$ such that $a_{k}^{j} T - E_{μ} [T_{k}^{j} (T)] = o (T^{α})$ whereas in Definition ? we make the choice $a_{k}^{j} = 1 / M$ . Our result could be extended to this case but we chose to keep the notation simple and focus on fair allocation of the optimal arms between players.

Interestingly, our lower bound is exactly a multiplicative constant factor $M$ away from the lower bound given by [5] for centralized algorithms (which is clearly a simpler setting). This intuitively suggests the number of players $M$ as the (multiplicative) “price of decentralized learning”. However, to establish our regret bound, we lower bounded the number of collisions by zero, which may be too optimistic. Indeed, for an algorithm to attain the lower bound , the number of selections of each sub-optimal arm should match the lower bound and term (b) and term (c) in the regret decomposition of Lemma 1 should be negligible compared to $log (T)$ . To the best of our knowledge, no algorithm has been shown to experience only $o (log (T))$ collisions so far, for every $M \in {2, \dots, K}$ and $μ \in P_{M}$ .

A lower bound on the minimal number of collisions experienced by any strongly uniformly efficient decentralized algorithm would thus be a nice complement to our Theorem 1, and it is left as future work.

3.3Towards Regret Upper Bounds

A natural approach to obtain an upper bound on the regret of an algorithm is to upper bound separately each of the three terms defined in Lemma 1. The following result shows that term $(b)$ can be related to the number of sub-optimal selections and the number of collisions that occurs on the $M$ best arms.

Lemma 3 The term $(b)$ in Lemma 1 is upper bounded as (Proved in Appendix A.3)

\begin{matrix} (b) \leq (μ_{1}^{*} - μ_{M}^{*}) (\sum k \in M - w o r s t E_{μ} [T_{k} (T)] + \sum k \in M - b e s t E_{μ} [C_{k} (T)]) . \\ (8) \end{matrix}

This result can also be used to recover Proposition 1 from [4], giving an upper bound on the regret that only depends on the expected number of sub-optimal selections – $E_{μ} [T_{k} (T)]$ for $k \in M - w o r s t$ – and the expected number of colliding players on the optimal arms – $E_{μ} [C_{k} (T)]$ for $k \in M - b e s t$ . Note that, in term (c) the number of colliding players on the sub-optimal arm $k$ may be upper bounded as $E_{μ} [C_{k} (T)] \leq M E_{μ} [T_{k} (T)]$ .

In the next Section, we present an algorithm that has a logarithmic regret, while ensuring that the number of sub-optimal selections is matching the lower bound of Theorem 1.

4New Algorithms for Multi-Player Bandits

When sensing is possible, that is under observation models (I) and (II), most existing strategies build on a single-player bandit algorithm (usually an index policy) that relies on the sensing information, together with an orthogonalization strategy to deal with collisions. We present this approach in more details in Section 4.1 and introduce two new algorithms of this kind, $R a n d T o p M$ and $M C T o p M$ . Then, we suggest in Section 4.2 a completely different approach, called $S e l f i s h$ , that no longer requires an orthogonalization strategy as the collisions are directly accounted for in the indices that are used. $S e l f i s h$ can also be used under observation model (III) –without sensing–, and without the knowledge of $M$ .

4.1Two New Strategies Based on Indices and Orthogonalization: $R a n d T o p M$ and $M C T o p M$

In a single-player setting, index policies are popular bandit algorithms: at each round one index is computed for each arm, that only depends on the history of plays of this arm and (possibly) some exogeneous randomness. Then, the arm with highest index is selected. This class of algorithms includes the UCB family, in which the index of each arm is an Upper Confidence Bound for its mean, but also some Bayesian algorithms like Bayes-UCB [18] or the randomized Thompson Sampling algorithm [28].

The approaches we now describe for multi-player bandits can be used in combination with any index policy, but we restrict our presentation to UCB algorithms, for which strong theoretical guarantees can be obtained. In particular, we focus on two types of indices: ${U C B}_{1}$ indices [6] and $k l - U C B$ indices [12], that can be defined for each player $j$ in the following way. Letting $S_{k}^{j} (t) := \sum_{s = 1}^{t} Y_{k, s} 1 (A^{j} (t) = k)$ the current sum of sensing information obtained by player $j$ for arm $k$ , ${ˆ μ}_{k}^{j} (t) = S_{k}^{j} (t) / T_{k}^{j} (t)$ (if $T_{k}^{j} (t) \neq 0$ ) is the empirical mean of arm $k$ for player $j$ and one can define the index

\begin{matrix} g_{k}^{j} (t) := ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} {ˆ μ}_{k}^{j} (t) + \sqrt{f (t) / (2 T_{k}^{j} (t))} & for {U C B}_{1}, sup {q \in [0, 1] : T_{k}^{j} (t) \times k l ({ˆ μ}_{k}^{j} (t), q) \leq f (t)} & for k l - U C B, \end{matrix} \\ (9) \end{matrix}

where $f (t)$ is some exploration function. $f (t)$ is usually taken to be $log (t)$ in practice, and slightly larger in theory, which ensures that $P (g_{k}^{j} (t) \geq μ_{k}) ≳ 1 - 1 / t$ (see [12]). A classical (single-player) UCB algorithm aims at the arm with largest index. However, if each of the $M$ players selects the arm with largest UCB, all the players will end up colliding most of the time on the best arm. To circumvent this problem, several coordination mechanisms have emerged, that rely on ordering the indices and targeting one of the $M$ -best indices.

While the $T D F S$ algorithm [23] relies on the player agreeing in advance on the time steps at which they will target each of the $M$ best indices (even though some alternative without pre-agreement are proposed), the $R h o R a n d$ algorithm [4] relies on randomly selected ranks. More formally, letting $π (k, g)$ be the index of the $k$ -th largest entry in a vector $g$ , in $R h o R a n d$ each player maintains at time $t$ an internal rank $R^{j} (t) \in {1, \dots, M}$ and selects at time $t$ ,

A^{j} (t) := π (R^{j} (t), [g_{ℓ}^{j} (t)]_{ℓ = 1, \dots, K}) .

If a collision occurs, a new rank is drawn uniformly at random: $R^{j} (t + 1) \sim U ({1, \dots, M})$ .

We now propose two alternatives to this strategy, that do not rely on ranks and rather randomly fix themselves on one arm in $^{j} (t)$ , that is defined as the set of arms that have the $M$ largest indices:

^{j} (t) := {π (k, {g_{ℓ}^{j} (t)}_{ℓ = 1, \dots, K}), k = 1, \dots, M} .

Our proposal $M C T o p M$ is stated below as Algorithm Figure 1, while a simpler variant, called $R a n d T o p M$ , is stated as Algorithm Figure 4 in Appendix C. We focus on $M C T o p M$ as it is easier to analyze and performs better. Both algorithms ensure that player $j$ always selects at time $t + 1$ an arm from $^{j} (t)$ . When a collision occurs $R a n d T o p M$ randomly switches arm within $^{j}$ , while $M C T o p M$ uses a more sophisticated mechanism, that is reminiscent of “Musical Chair” (MC) and inspired by the work of [26]: players tend to fix themselves on arms (“chairs”) and ignore future collision when this happens.

Figure 1: The \mathrm{MCTopM} decentralized learning policy (for a fixed underlying index policy g^j). — Figure 1: The $M C T o p M$ decentralized learning policy (for a fixed underlying index policy $g^{j}$ ).

More precisely, under $M C T o p M$ , if player $j$ did not encounter a collision when using arm $k$ at time $t$ , then she marks her current arm as a “chair” ( $s^{j} (t + 1) = T r u e$ ), and will keep using it even if collisions happen in the future (Lines $9$ - $11$ ). As soon as this “chair” $k$ is no longer in $_{j} (t)$ , a new arm is sampled uniformly from a subset of $^{j} (t)$ , defined with the previous indices $g^{j} (t - 1)$ (Lines $3$ - $5$ ). The subset enforces a certain inequality on indices, $g_{k^{'}}^{j} (t - 1) \leq g_{k}^{j} (t - 1)$ and $g_{k^{'}}^{j} (t) \geq g_{k}^{j} (t)$ , when switching from $k = A^{j} (t)$ to $k^{'} = A^{j} (t + 1)$ . This helps to control the number of such changes of arm, as shown in Lemma 5. The considered subset is never empty as it contains at least the arm replacing the $k \in^{j} (t - 1)$ in $^{j} (t)$ . Collisions are dealt with only for non-fixed player $j$ , and when the previous arm is still in $^{j} (t)$ . In this case, a new arm is sampled uniformly from $^{j} (t)$ (Lines $6$ - $8$ ). This stationary aspect helps to minimize the number of collisions, as well as the number of switches of arm. The five differents transitions $(1)$ , $(2)$ , $(3)$ , $(4)$ , $(5)$ refer to the notations used in the analysis of $M C T o p M$ (see Figure 5 in Appendix D.3).

4.2The $S e l f i s h$ Approach

Under observation model (III) no sensing information is available and the previous algorithms cannot be used, as the sum of sensing information $S_{k}^{j} (t)$ and thus the empirical mean ${ˆ μ}_{k}^{j} (t)$ cannot be computed, hence neither the indices $g_{k}^{j} (t)$ . However, one can still define a notion of empirical reward received from arm $k$ by player $j$ , by introducing

{_{k}}^{j} (t) = T \sum t = 1 r^{j} (t) 1 (A^{j} (t) = k) and letting {_{k}}^{j} (t) := {_{k}}^{j} (t) / T_{k}^{j} (t) .

Note that ${_{k}}^{j} (t)$ is no longer meant to be an unbiased estimate of $μ_{k}$ as it also takes into account the collision information, that is present in the reward. Based on this empirical reward, one can similarly defined modified indices as

\begin{matrix} {_{k}}^{j} (t) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} {_{k}}^{j} (t) + \sqrt{f (t) / (2 T_{k}^{j} (t))} & for {U C B}_{1}, sup {q \in [0, 1] : T_{k}^{j} (t) \times k l ({_{k}}^{j} (t), q) \leq f (t)} & for k l - U C B . \end{matrix} \\ (10) \end{matrix}

Given any of these two index policies ( ${U C B}_{1}$ or $k l - U C B$ ), the $S e l f i s h$ algorithm is defined by,

\begin{matrix} A^{j} (t) = arg max k \in {1, \dots, K} {_{k}}^{j} (t - 1) . \\ (11) \end{matrix}

The name comes from the fact that each player is targeting, in a “selfish” way, the arm that has the highest index, instead of accepting to target only one of the $M$ best. The reason that this may work precisely comes from the fact that ${_{k}}^{j} (t)$ is no longer an upper-confidence on $μ_{k}$ , but some hybrid index that simultaneously increases when a transmission occurs and decreases when a collision occurs.

This behavior is easier to be understood for the case of $S e l f i s h$ - ${U C B}_{1}$ in which, letting $N_{k}^{j, C} (t) = \sum_{s = 1}^{t} 1 (C^{j} (t))$ be the number of collisions on arm $k$ , one can show that the hybrid $S e l f i s h$ index induces a penalty proportional to the fraction of collision on this arm and the quality of the arm itself:

˜gkj(t)=gjk(t)−⎛⎝Nj,Ck(t)Njk(t)⎞⎠fraction of collisions⎛⎝1Nj,Ck(t)T∑t=1YAj(t),t1(Cj(t))1(Aj(t)=k)⎞⎠estimate of μk.

From a bandit perspective, it looks like each player is using a stochastic bandit algorithm ( ${U C B}_{1}$ or $k l - U C B$ ) when interacting with $K$ arms that give a feedback (the reward, and not the sensing information) that is far from being i.i.d. from some distribution, due to the collisions. As such, the algorithm does not appear to be well justified, and one may rather want to use adversarial bandit algorithms like $E X P 3$ [7], that do not require a stochastic (i.i.d.) assumption on arms. However, we found out empirically that $S e l f i s h$ is doing surprisingly well, as already noted by [10], who did some experiments in the context of IoT applications. We show in Section 6 that $S e l f i s h$ does have a (very) small probability to fail (badly), for some problem with small $K$ , which precludes the possibility of a logarithmic regret for any problem. However, in most cases it empirically performs similarly to all the algorithms described before, and usually outperforms $R h o R a n d$ , even if it neither exploits the sensing information, nor the knowledge of the number of players $M$ . As such, practitioners may still be interested by the algorithm, especially for Cognitive Radio applications in which sensing is hard or not considered.

5Empirical performances

We illustrate here the empirical performances of the algorithms presented in Section 4, used with two index policies, ${U C B}_{1}$ and $k l - U C B$ . Some plots are at pages and and most of them in Appendix F.2.

As expected, the same algorithm always performs better using $k l - U C B$ rather than ${U C B}_{1}$ , as the theoretical guarantees for single player suggested, and the main goal of this work is not to optimize on the index policy but rather propose new ways of using indices, in a decentralized setting⁴, so we only consider $k l - U C B$ in the experiments. We chose to use two example problems for the illustrations: the first one is with $K = 3$ arms and means⁵ $μ = [0.1, 0.5, 0.9]$ , for which two cases $M = 2$ and $M = 3$ are presented in Figure 9. The second problem is with $K = 9$ and $μ = [0.1, 0.2, \dots, 0.9]$ , for which three cases are presented, for $M = 6$ in Figure 3, and for the two limit cases with $M = 2$ and $M = 9 = K$ in Figure 14.

Performance is measured using the average regret, with $1000$ repetitions on the same problem, from $t = 1$ to the horizon $T = 10000$ (also unknown by the algorithms), but we also include histograms showing the distribution of regret at $t = T$ , as this allows to check if the regret is indeed small for each run of the simulation. For the plots showing the regret, our asymptotic lower bound from Theorem 1 is displayed.

Experiments with a different problem for each repetition (uniformly sampled $μ \sim U ([0, 1]^{K})$ ), are also considered, in Figure 2 and Figure 12. This helps to check that no matter the complexity of the considered problem (one measure of complexity being the constant in our lower bound), $M C T o p M$ performs similarly or better than all the other algorithms, and $S e l f i s h$ outperforms $R h o R a n d$ in most cases. Figure 2 is a good example of outstanding performances of $M C T o p M$ and $S e l f i s h$ in comparison to $R h o R a n d$ .

Figure 2: Regret, M=6 players, K=9 arms, horizon T=5000, against 500 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. \mathrm{Rho}\mathrm{Rand} (top blue curve) is outperformed by the other algorithms (and the gain increases with M). \mathrm{MCTopM} (bottom yellow) outperforms all the other algorithms is most cases. — Figure 2: Regret, $M = 6$ players, $K = 9$ arms, horizon $T = 5000$ , against $500$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . $R h o R a n d$ (top blue curve) is outperformed by the other algorithms (and the gain increases with $M$ ). $M C T o p M$ (bottom yellow) outperforms all the other algorithms is most cases.

Empirically, our proposals were found to almost always outperform $R h o R a n d$ , and except for $S e l f i s h$ that can fail badly on problems with small $K$ , we verified that $M C T o p M$ outperforms the state-of-the-art algorithms in many different problems, and is more and more efficient as $M$ and $K$ grows.

Two other multi-player multi-armed bandits algorithms are $M E G A$ by [8] and $M u s i c a l C h a i r$ by [26], that were proposed for the observation model (II). We chose to not include comparisons with any of them, as they were both found hard to use efficiently in practice. Indeed, $M E G A$ needs a careful tuning of five parameters ( $c$ , $d$ , $p_{0}$ , $α$ and $β$ ) to attain reasonable performances (and using cross validation, as the authors suggested, is usually considered out of the scope of online sequential learning). $M u s i c a l C h a i r$ was shown to attain a constant regret (with high probability) only for very large horizons, and empirically it requires a fine tuning of its parameter $T_{0}$ , as the theoretical value requires to know the small gap between two arms, unavailable in practice.

Figure 3: Regret (in \log\log scale), for M=6 players for K=9 arms, horizon T=5000, for 1000 repetitions on problem \boldsymbol{\mu}=[0.1,\dots,0.9]. \mathrm{RandTopM} (yellow curve) outperforms \mathrm{Selfish} (green), both clearly outperform \mathrm{Rho}\mathrm{Rand}. The regret of \mathrm{MCTopM} is logarithmic, empirically with the same slope as the lower bound. The x axis on the regret histograms have different scale for each algorithm. — Figure 3: Regret (in $log log$ scale), for $M = 6$ players for $K = 9$ arms, horizon $T = 5000$ , for $1000$ repetitions on problem $μ = [0.1, \dots, 0.9]$ . $R a n d T o p M$ (yellow curve) outperforms $S e l f i s h$ (green), both clearly outperform $R h o R a n d$ . The regret of $M C T o p M$ is logarithmic, empirically with the same slope as the lower bound. The $x$ axis on the regret histograms have different scale for each algorithm.

Regret (in \log\log scale), for M=6 players for K=9 arms, horizon T=5000, for 1000 repetitions on problem \boldsymbol{\mu}=[0.1,\dots,0.9]. \mathrm{RandTopM} (yellow curve) outperforms \mathrm{Selfish} (green), both clearly outperform \mathrm{Rho}\mathrm{Rand}. The regret of \mathrm{MCTopM} is logarithmic, empirically with the same slope as the lower bound. The x axis on the regret histograms have different scale for each algorithm. — Figure 3: Regret (in $log log$ scale), for $M = 6$ players for $K = 9$ arms, horizon $T = 5000$ , for $1000$ repetitions on problem $μ = [0.1, \dots, 0.9]$ . $R a n d T o p M$ (yellow curve) outperforms $S e l f i s h$ (green), both clearly outperform $R h o R a n d$ . The regret of $M C T o p M$ is logarithmic, empirically with the same slope as the lower bound. The $x$ axis on the regret histograms have different scale for each algorithm.

6Theoretical elements

Section 6.1 gives an asymptotically optimal analysis of the expected number of sub-optimal draws for $R a n d T o p M$ , $M C T o p M$ and $R h o R a n d$ combined with $k l - U C B$ indices, and Section 6.2 proves that the number of collisions, hence the regret of $M C T o p M$ are also logarithmic. Section 6.3 shortly discusses a disappoiting result regarding $S e l f i s h$ , with more insights provided in Appendix E.

6.1Common Analysis for $R a n d T o p M$ - and $M C T o p M$ - $k l - U C B$

Lemma 4 gives a finite-time upper bound on the expected number of draws of a sub-optimal arm $k$ for any player $j$ , that holds for both $R a n d T o p M$ - $k l - U C B$ and $M C T o p M$ - $k l - U C B$ . Our improved analysis also applies to the $R h o R a n d$ algorithm. Explicit formulas for $C_{μ}$ , $D_{μ}$ are in the proof in Appendix D.1.

Lemma 4 For any $μ \in P_{M}$ , if player $j \in {1, \dots, M}$ uses the $R a n d T o p M$ -, $M C T o p M$ - or $R h o R a n d$ - $k l - U C B$ decentralized policy with exploration function $f (t) = log (t) + 3 log log (t)$ , then for any sub-optimal arm $k \in M - w o r s t$ , there exists two problem depend constants $C_{μ}$ , $D_{μ}$ such that

\begin{matrix} E_{μ} [T_{k}^{j} (T)] \leq \frac{log (T)}{k l (μ_{k}, μ_{M}^{*})} + C_{μ} \sqrt{log (T)} + D_{μ} log log (T) + 3 M + 1. \\ (12) \end{matrix}

It is important to notice that the leading constant in front of $log (T)$ is the same as in the constant featured in Equation of Theorem 1. This result proves that the lower bound on sub-optimal selections is asymptotically matched for the three considered algorithms. This is a strong improvement in comparison to the previous state-of-the-art results [23].

As announced, Lemma 5 controls the number of switches of arm that are due to the current arm leaving $^{j} (t)$ , for both $R a n d T o p M$ and $M C T o p M$ . It essentially proves that Lines $3$ - $5$ in Algorithm Figure 1 (when a new arm is sampled from the non-empty subset of $^{j} (t)$ ) happen a logarithmic number of times. The proof of this result is given in Appendix D.2.

Lemma 5 For any $μ \in P_{M}$ , any player $j \in {1, \dots, M}$ using $R a n d T o p M$ - or $M C T o p M$ - $k l - U C B$ , and any arm $k$ , it holds that

T \sum t = 1 P (A^{j} (t) = k, k \notin^{j} (t)) = ⎛ ⎜ ⎝ \sum k^{'}, μ_{k^{'}} < μ_{k} \frac{1}{k l (μ_{k}, μ_{k^{'}})} + \sum k^{'}, μ_{k^{'}} > μ_{k} \frac{1}{k l (μ_{k^{'}}, μ_{k})} ⎞ ⎟ ⎠ log (T) + o (log (T)) .

6.2Regret Analysis of $M C T o p M$ - $k l - U C B$

For $M C T o p M$ , we are furthermore able to obtain a logarithmic regret upper bound, by proposing an original approach to control the number of collisions under this algorithm. First, we can bound the number of collisions by the number of collisions for players not yet “fixed on their arms” ( $¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t)$ ), that we can then bound by the number of changes of arms (cf proof in Appendix D.3). An interesting consequence of the proof of this result is that it also bounds the number of switches of arms, $\sum_{t = 1}^{T} P (A^{j} (t + 1) \neq A^{j} (t))$ , and this additional guarantee was never clearly stated for previous state-of-the-art works, like $R h o R a n d$ . Even though minimizing switching was not a goal⁶, this guarantee is interesting for Cognitive Radio applications, where switching arms means reconfiguring a radio hardware, an operation that costs energy.

Lemma 6 For any $μ \in P_{M}$ , if all players use the $M C T o p M$ - $k l - U C B$ decentralized policy, and $M \leq K$ , then the total average number of collisions (on all arms) is upper-bounded by (Proved in Appendix D.3)

E_{μ} [K \sum k = 1 C_{k} (T)] \leq M^{2} (2 M + 1) ⎛ ⎝ \sum a, b = 1, \dots, K, μ_{a} < μ_{b} \frac{1}{k l (μ_{a}, μ_{b})} ⎞ ⎠ log (T) + o (log T) .

Note that this bound is in $O (M^{3})$ , which is significantly better than $O (M (\frac{2 M - 1}{M}))$ proved by [4] for $R h o R a n d$ . But it is worse than $O (M^{2})$ for $M u s i c a l C h a i r$ proved by [26], due to our trick of focusing on collisions for non-sitted players.

Now that the sub-optimal arms selections and the collisions are both proved to be at most logarithmic in Lemmas 4 and 5, it follows from our regret decomposition (Lemma 1) together with Lemma 3 that the regret of $M C T o p M$ - $k l - U C B$ is logarithmic. More precisely, one obtains a finite-time problem-depend upper bound on the regret of this algorithm.

Theorem 2 If all $M$ players use $M C T o p M$ - $k l - U C B$ , and $M \leq K$ , then for any problem $μ \in P_{M}$ , there exists a problem dependent constant $G_{M, μ}$ , such that the regret satisfies:

\begin{matrix} R_{T} (μ, M, ρ) \leq G_{M, μ} log (T) + o (log T) . \\ (13) \end{matrix}

6.3Discussion on $S e l f i s h$

The analysis of $S e l f i s h$ is harder, but we tried our best to obtain some understanding of the behavior of this algorithm, that seems to be doing surpisingly well in many contexts, as in our experiments with $K = 9$ arms and in extensive experiments not reported in this paper. However, a disappointing result is that we found simple problems, usually with small number of arms, for which the algorithm may fail. For example with $M = 2$ or $M = 3$ players competing for $K = 3$ arms, with means $μ = [0.1, 0.5, 0.9]$ , the histograms in Figure 9 suggests that with a small probability, the regret $R_{T}$ of $S e l f i s h$ - $k l - U C B$ can be very large. We provide a discussion in Appendix E about when such situations may happen, including a conjectured (constant, but small) lower bound on the probability that $S e l f i s h$ experience collision almost at every round. This result would then prevent $S e l f i s h$ from having a logarithmic regret. However, it is to be noted that the lower bound of Theorem 1 does not apply to the censored observation model (III) under which $S e l f i s h$ operates, and it is not known yet whether logarithmic regret is at all possible.

7Conclusion and future work

To summarize, we presented three variants of Multi-Player Multi-Arm Bandits, with different level of feedback being available to the decentralized players, under which we proposed efficient algorithms. For the two easiest models –with sensing–, our theoretical contribution improves both the state-of-the-art upper and lower bounds on the regret. In the absence of sensing, we also provide some motivation for the practical use of the interesting $S e l f i s h$ heuristic, a simple index policy based on hybrid indices that are directly taking into account the collision information.

This work suggests several interesting further research directions. First, we want to investigate the notion of optimal algorithms in the decentralized multi-player model with sensing information. So far we provided the first matching upper and lower bound on the expected number of sub-optimal arms selections, which suggests some form of (asymptotic) optimality. However, sub-optimal draws turn out not be the dominant terms in the regret, both in our upper bounds and in practice, thus an interesting future work is to identify some notion of minimal number of collisions. Second, it remains an open question to know if a simple decentralized algorithm can be as efficient as $M C T o p M$ without knowing $M$ in advance, or in dynamic settings (when $M$ can change in time). We shall start by proposing variants of our algorithm that are inspired by the $R h o E s t$ variant of $R h o R a n d$ proposed by [4]. Finally, we want to strengthen the guarantees obtained in the absence of sensing, that is to know whether logarithmic regret is achievable and to have a better analysis of the $S e l f i s h$ approach. Indeed, in most cases, it performs comparably to $R a n d T o p M$ even with limited feedback and without knowing the number of players $M$ , which makes it a good candidate for applications to Internet of Things networks.

ARegret Decomposisions

a.1Proof of Lemma 1

Using the definition of regret $R_{T}$ from , and this collision indicator $η^{j} (t) := 1 (¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ C^{j} (t))$ ,

\begin{matrix} R (T) & = (M \sum k = 1 μ_{k}^{*}) T - E_{μ} [T \sum t = 1 M \sum j = 1 Y_{A^{j} (t), t} η^{j} (t)] = (M \sum k = 1 μ_{k}^{*}) T - E_{μ} [T \sum t = 1 M \sum j = 1 μ_{A^{j} (t)} η^{j} (t)] \end{matrix}

The last equality comes from the linearity of expectations, and the fact that $E_{μ} [Y_{k, t}] = μ_{k}$ (for all $t$ , from the i.i.d. hypothesis), and the independence from $A^{j} (t)$ , $η^{j} (t)$ and $Y_{k, t}$ (observed after playing $A^{j} (t)$ ). So $E_{μ} [Y_{A^{j} (t), t} η^{j} (t)] = \sum_{k} E_{μ} [μ_{k} 1 (A^{j} (t), t) η^{j} (t)] = E_{μ} [μ_{A^{j} (t)} η^{j} (t)]$ . And so

\begin{matrix} R (T) & = E_{μ} ⎡ ⎣ T \sum t = 1 \sum j \in M - b e s t μ_{j} - T \sum t = 1 M \sum j = 1 μ_{A^{j} (t)} η^{j} (t) ⎤ ⎦ = ⎛ ⎝ \frac{1}{M} \sum j \in M - b e s t μ_{j} ⎞ ⎠ - K \sum k = 1 M \sum j = 1 μ_{k} E_{μ} [T_{k}^{j} (T)] + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] . \end{matrix}

For the first term, we have $T M = K \sum k = 1 M \sum j = 1 E_{μ} [T_{k}^{j} (T)]$ , and if we denote ${¯ ¯ ¯ μ}^{*} := \frac{1}{M} \sum j \in M - b e s t μ_{j}$ the average mean of the $M$ -best arms, then,

\begin{matrix} = K \sum k = 1 M \sum j = 1 ({¯ ¯ ¯ μ}^{*} - μ_{k}) E_{μ} [T_{k}^{j} (T)] + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] . \end{matrix}

Let $_{k} := {¯ ¯ ¯ μ}^{*} - μ_{k}$ be the gap between the mean of the arm $k$ and the $M$ -best average mean, and if $M^{*}$ denotes the index of the worst of the $M$ -best arms (i.e., $M^{*} = arg {min}_{k \in M - b e s t} (μ_{k})$ ), then by splitting ${1, \dots, K}$ into three disjoint sets $M - b e s t \cup M - w o r s t = (M - b e s t ∖ {M^{*}}) \cup {M^{*}} \cup M - w o r s t$ , we get

\begin{matrix} = \sum k \in M - b e s t ∖ {M}_{k} E_{μ} [T_{k} (T)] +_{M^{*}} E_{μ} [T_{M^{*}} (T)] + \sum k \in M - w o r s t_{k} E_{μ} [T_{k} (T)] + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] . \end{matrix}

But for $k = M^{*}$ , $T_{M^{*}} (T) = T M^{*} - \sum k \in M - b e s t ∖ {M} E_{μ} [T_{k} (T)] - \sum k \in M - w o r s t E_{μ} [T_{k} (T)]$ , so by recombining the terms, we obtain,

\begin{matrix} = \sum k \in M - b e s t ∖ {M} (_{k} -_{M^{*}}) E_{μ} [T_{k} (T)] +_{M^{*}} T M^{*} + \sum k \in M - w o r s t (_{k} -_{M^{*}}) E_{μ} [T_{k} (T)] + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] . \end{matrix}

The term $_{k} -_{M^{*}}$ simplifies to $μ_{M^{*}} - μ_{k}$ , and so $_{M^{*}} = \frac{1}{M} \sum_{k = 1}^{M} μ_{k} - μ_{M^{*}}$ by definition of ${¯ ¯ ¯ μ}^{*}$ . And for $k = M^{*}$ , $μ_{M^{*}} - μ_{k} = 0$ , so the first sum can be written for $k = 1, \dots, M$ only, so

\begin{matrix} R (T) & = \sum k \in M - b e s t (μ_{M^{*}} - μ_{k}) E_{μ} [T_{k} (T)] + \sum k \in M - b e s t (μ_{k} - μ_{M^{*}}) T + \sum k \in M - w o r s t (μ_{M^{*}} - μ_{k}) E_{μ} [T_{k} (T)] + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] \end{matrix}

And so we obtain the decomposition with three terms (a), (b) and (c).

\begin{matrix} R (T) & = \sum k \in M - b e s t (μ_{k} - μ_{M^{*}}) (T - E_{μ} [T_{k} (T)]) + \sum k \in M - w o r s t (μ_{M^{*}} - μ_{k}) E_{μ} [T_{k} (T)] + K \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] . \end{matrix}

a.2Proof of Lemma 2

Note that term (c) is clearly lower bounded by $0$ but it is not obvious for (b) as there is no reason for $T_{k} (T)$ to be upper bounded by $T$ . Let $T_{k}^{!} (T) := \sum_{t = 1}^{T} 1 (\exists! j, A^{j} (t) = k)$ , where the notation $\exists!$ stands for “there exists a unique”. Then $T_{k} (T) = \sum_{t = 1}^{T} \sum_{j = 1}^{M} 1 (A^{j} (t) = k)$ can be decomposed as

T_{k} (T) = T \sum t = 1 1 (\exists! j, A^{j} (t) = k) + T \sum t = 1 M \sum j = 1 c_{k, t} 1 (A^{j} (t) = k) = T_{k}^{!} (T) + C_{k} (T) .

By focusing on the two terms $(b) + (c)$ from the decomposition of $R_{T} (μ, M, ρ)$ from Lemma 1, we have

\begin{matrix} (b) + (c) & = \sum k \in M - b e s t (μ_{k} - μ_{M}^{*}) (T - E_{μ} [T_{k}^{!} (T)]) + \sum k \in M - b e s t μ_{M}^{*} E_{μ} [C_{k} (T)] + M \sum k = 1 μ_{k} E_{μ} [C_{k} (T)] - \sum k \in M - b e s t μ_{k} E_{μ} [C_{k} (T)] = \sum k \in M - b e s t (μ_{k} - μ_{M}^{*}) (T - E_{μ} [T_{k}^{!} (T)]) + \sum k \in M - b e s t μ_{M}^{*} E_{μ} [C_{k} (T)] + \sum k \in M - w o r s t μ_{k} E_{μ} [C_{k} (T)] = \sum k \in M - b e s t (μ_{k} - μ_{M}^{*}) (T - E_{μ} [T_{k}^{!} (T)]) + M \sum k = 1 min (μ_{M}^{*}, μ_{k}) E_{μ} [C_{k} (T)] . \end{matrix}

And now both terms are non-negative, as $T_{k}^{!} (T) \leq T$ , $min (μ_{M}^{*}, μ_{k}) \geq 0$ , and $C_{k} (T) \geq 0$ , so $(b) + (c) \geq 0$ which proves that $R_{T} (μ, M, ρ) \geq (a)$ , as wanted.

a.3Proof of Lemma 3

Recall that we want to upper bound $(b) := \sum_{k \in M - b e s t} (μ_{k} - μ_{M *}) (T - E_{μ} [T_{k} (T)])$ . First, we observe that, for all $k \in M - b e s t$ ,

\begin{matrix} T - E_{μ} [T_{k} (T)] & \leq & T - E_{μ} [T \sum t = 1 1 (\exists j : A^{j} (t) = k)] = E_{μ} [T \sum t = 1 1 (\forall j, A_{j} (t) \neq k)] = E_{μ} [T \sum t = 1 1 (k \notin {ˆ S}_{t})], \end{matrix}

where we denote by ${ˆ S}_{t} = {A^{j} (t), j \in {1, \dots, M}}$ the set of selected arms at time $t$ (with no repetition). With this notation one can write

\begin{matrix} (b) & \leq & (μ_{1} - μ_{M^{*}}) \sum k \in M - b e s t (T - E_{μ} [T_{k} (T)]) \leq (μ_{1} - μ_{M^{*}}) E_{μ} ⎡ ⎣ \sum k \in M - b e s t T \sum t = 1 1 (k \notin {ˆ S}_{t}) ⎤ ⎦ = & (μ_{1} - μ_{M^{*}}) E_{μ} ⎡ ⎣ T \sum t = 1 \sum k \in M - b e s t 1 (k \notin {ˆ S}_{t}) ⎤ ⎦ . \end{matrix}

The quantity $\sum_{k \in M - b e s t} 1 (k \notin {ˆ S}_{t})$ counts the number of optimal arms that have not been selected at time $t$ . For each mis-selection of an optimal arm, there either exists a sub-optimal arm that has been selected, or an arm in $M - b e s t$ on which a collision occurs. Hence

\sum k \in M - b e s t 1 (k \notin {ˆ S}_{t}) = \sum k \in M - b e s t 1 (C_{k} (t)) + \sum k \in M - w o r s t 1 (\exists j : A^{j} (t) = k),

which yields

E_{μ} ⎡ ⎣ T \sum t = 1 \sum k \in M - b e s t 1 (k \notin {ˆ S}_{t}) ⎤ ⎦ \leq \sum k \in M - b e s t E_{μ} [C_{k} (T)] + \sum k \in M - w o r s t E_{μ} [T_{k} (T)]

and Lemma 3 follows.

BLower Bound: Proof of Theorem 1

b.1Proof of Theorem 1

The lower bound that we present relies on the following change-of-distribution lemma that we prove in the next section, following recent arguments from [14] that have to be adapted to incorporate the collision information.

Lemma 6 Under observation model (I) and (II), for every event $A$ that is $F_{T}^{j}$ -measurable, considering two multi-player bandit models denoted by $μ$ and $λ$ respectively, it holds that

K \sum k = 1 E_{μ} [T_{k}^{j} (T)] k l (μ_{k}, λ_{k}) \geq k l (P_{μ} (A), P_{λ} (A)) .

Let $k$ be a sub-optimal arm under $μ$ , fix $ε \in (0, μ_{M - 1}^{*} - μ_{M}^{*})$ , and let $λ$ be the bandit instance such that

{\begin{matrix} λ_{ℓ} & = & μ_{ℓ} for all ℓ \neq k, λ_{k} & = & μ_{M^{*}} + ε . \end{matrix}

Clearly, $λ \in P_{M}$ also, and the set of $M$ best arms under $μ$ and $λ$ differ by one arm: if $M - b e s t$ $_{μ} = {1^{*}, \dots, M^{*}}$ then $M - {b e s t}_{λ} = {1^{*}, \dots, (M - 1)^{*}, k}$ . Thus, one expects the ( $F_{T}^{j}$ -mesurable) event

A_{T} = (T_{k}^{j} (T) > \frac{T}{2 M})

to have a small probability under $μ$ (under which $k$ is sub-optimal) and a large probability under $λ$ (under which $k$ is one of the optimal arms, and is likely to be drawn a lot).

Applying the inequality in Lemma ?, and noting that the sum in the left-hand side reduces to on term as there is a single arm whose distribution is changed, one obtains

\begin{matrix} E_{μ} [T_{k}^{j} (T)] k l (μ_{k}, μ_{M^{*}} + ε) & \geq & k l (P_{μ} (A_{T}), P_{λ} (A_{T})), E_{μ} [T_{k}^{j} (T)] k l (μ_{k}, μ_{M^{*}} + ε) & \geq & (1 - P_{μ} (A_{T})) log ⎛ ⎝ \frac{1}{P_{λ} (_{T})} ⎞ ⎠ - log (2), \end{matrix}

using the fact that the binary KL-divergence satisfies $k l (x, y) = k l (1 - x, 1 - y)$ as well as the inequality $k l (x, y) \geq x log (1 / y) - log (2)$ , proved by [14]. Now, using Markov inequality yields

\begin{matrix} P_{μ} (A_{T}) & \leq 2 M \frac{E_{μ} [T_{k}^{j} (T)]}{T} =: x_{T}, P_{λ} (_{T}) & = P_{λ} (\frac{T}{M} - T_{k}^{j} (T) > \frac{T}{2 M}) \leq 2 M \frac{E_{λ} [\frac{T}{M} - T_{k}^{j} (T)]}{T} =: \frac{y_{T}}{T}, \end{matrix}

which defines two sequences $x_{T}$ and $y_{T}$ , such that

\begin{matrix} \begin{matrix} E_{μ} [T_{k}^{j} (T)] k l (μ_{k}, μ_{M^{*}} + ε) & \geq & (1 - x_{T}) log (\frac{T}{y_{T}}) - log (2) . \end{matrix} \\ (14) \end{matrix}

The strong uniform efficiency assumption (see Definition ?) further tells us that $x_{T} \to 0$ (as $E_{μ} [T_{k}^{j} (T)] = o (T^{α})$ for all $α$ ) and $y_{T} = o (T^{α})$ when $T \to \infty$ , for all $α \in (0, 1)$ . As a consequence, observe that $log (y_{T}) / log (T) \to 0$ when $T$ tends to infinity and

\frac{(1 - x_{T}) log (T / y_{T}) - log (2)}{log (T)} = 1 - x_{T} - (1 - x_{T}) \frac{log (y_{T})}{log (T)} - \frac{log (2)}{log (T)}

tends to one when $T$ tends to infinity. From Equation , this yields

\begin{matrix} liminf T \to \infty \frac{E_{μ} [T_{k}^{j} (T)]}{log (T)} \geq \frac{1}{k l (μ_{k}, μ_{M^{*}} + ε)}, \\ (15) \end{matrix}

for all $ε \in (0, μ_{M - 1}^{*} - μ_{M}^{*})$ . Letting $ε$ go to zero gives the conclusion (as $k l$ is continuous).

b.2Proof of Lemma

Under observation model (I) and (II), the strategy $A^{j} (t)$ decides which arm to play based on the information contained in $I_{t - 1}$ , where $I_{0} = U_{0}$ and

\forall t > 0, I_{t} = (U_{0}, Y_{1}, C_{1}, U_{1}, \dots, Y_{t}, C_{t}, U_{t})

where $Y_{t} := Y_{A^{j} (t - 1), t}$ denotes the sensing information, $C_{t} := C^{j} (t)$ denotes the collision information (not always completely exploited under observation model (II)) and $U_{t}$ denotes some external source of randomness⁷ useful to select $A^{j} (t)$ . Formally, one can say that $A^{j} (t)$ is $σ (I_{t - 1})$ measurable⁸ (as $F^{j} (t) \subseteq σ (I_{t})$ , with an equality under observation model (I)).

Under two bandit models $μ$ and $λ$ , we let $P_{μ}^{I_{t}}$ (resp. $P_{λ}^{I_{t}}$ ) be the distribution of the observations under model $μ$ (resp. $λ$ ), given a fixed algorithm. Using the exact same technique as [14] (the contraction of entropy principle), one can establish that for any event $A$ that is $σ (I_{t})$ -measurable⁹,

K L (P_{μ}^{I_{t}}, P_{λ}^{I_{t}}) \geq k l (P_{μ} (A), P_{λ} (A)) .

The next step is to relate the complicated KL-divergence $K L (P_{μ}^{I_{t}}, P_{λ}^{I_{t}})$ to the number of arm selections. Proceeding similarly as [14], one can write, using the chain rule for KL-divergence, that

\begin{matrix} K L (P_{μ}^{I_{t}}, P_{λ}^{I_{t}}) = K L (P_{μ}^{I_{t - 1}}, P_{λ}^{I_{t - 1}}) + K L (P_{μ}^{Y_{t}, C_{t}, U_{t} | I_{t - 1}}, P_{λ}^{Y_{t}, C_{t}, U_{t} | I_{t - 1}}) . \\ (16) \end{matrix}

Now observe that conditionally to $I_{t - 1}$ , $U_{t}$ , $Y_{t}$ and $C_{t}$ are independent, as once the selected arm is known, the value of the sensing $Y_{t}$ does not influence the other players selecting that arm, and $U_{t}$ is some exogenous randomness. Using further that the distribution of $U_{t}$ is the same under $μ$ and $λ$ , one obtains

\begin{matrix} K L (P_{μ}^{Y_{t}, C_{t} | I_{t - 1}}, P_{λ}^{Y_{t}, C_{t} | I_{t - 1}}) = K L (P_{μ}^{Y_{t} | I_{t - 1}}, P_{λ}^{Y_{t} | I_{t - 1}}) + K L (P_{μ}^{C_{t} | I_{t - 1}}, P_{λ}^{C_{t} | I_{t - 1}}) . \\ (17) \end{matrix}

The first term in can be rewritten using the same argument as [14], that relies on the fact that conditionally to $I_{t - 1}$ , $Y_{t}$ is a Bernoulli distribution with mean $μ_{A^{j} (t)}$ under the instance $μ$ and $λ_{A^{j} (t)}$ under the instance $λ$ :

\begin{matrix} K L (P_{μ}^{Y_{t} | I_{t - 1}}, P_{λ}^{Y_{t} | I_{t - 1}}) & = & E_{μ} [E_{μ} [K L (P_{μ}^{Y_{t} | I_{t - 1}}, P_{λ}^{Y_{t} | I_{t - 1}}) | I_{t - 1}]] = & E_{μ} [k l (μ_{A^{j} (t)}, λ_{A^{j} (t)})] = & E_{μ} [K \sum k = 1 1 (A^{j} (t) = k) k l (μ_{k}, λ_{k})], \end{matrix}

We now show that second term in is zero:

\begin{matrix} K L (P_{μ}^{C_{t} | I_{t - 1}}, P_{λ}^{C_{t} | I_{t - 1}}) & = & E_{μ} [E_{μ} [K L (P_{μ}^{C_{t} | I_{t - 1}}, P_{λ}^{C_{t} | I_{t - 1}}) | I_{t - 1}]] = & E_{μ} ⎡ ⎣ E_{μ} ⎡ ⎣ E_{μ} ⎡ ⎣ K L (P_{μ}^{C_{t} | I_{t - 1}}, P_{λ}^{C_{t} | I_{t - 1}}) | ⋃ j^{'} \neq j I_{t - 1}^{j^{'}} ⎤ ⎦ | I_{t - 1} ⎤ ⎦ ⎤ ⎦, \end{matrix}

where $I_{t - 1}^{j^{'}}$ denote the information available to player $j^{'} \neq j$ . Knowing the information available to all other players player $C_{t}$ is an almost surely constant random variable, whose distribution is the same under $μ$ and $λ$ . Hence the inner expectation is zero and so does $K L (P_{μ}^{C_{t} | I_{t - 1}}, P_{λ}^{C_{t} | I_{t - 1}})$ .

Putting things together, we showed that

K L (P_{μ}^{I_{t}}, P_{λ}^{I_{t}}) = K L (P_{μ}^{I_{t - 1}}, P_{λ}^{I_{t - 1}}) + E_{μ} [K \sum k = 1 1 (A^{j} (t) = k) k l (μ_{k}, λ_{k})] .

Iterating this equality and using that $K L (P_{μ}^{I_{0}}, P_{λ}^{I_{0}}) = 0$ yields that

K \sum k = 1 E_{μ} [T_{k}^{j} (T)] k l (μ_{k}, λ_{k}) \geq k l (P_{μ} (A), P_{λ} (A)),

for all $A \in σ (I_{T})$ , in particular for all $A \in F_{T}^{j}$ .

CThe $R a n d T o p M$ algorithm

We now state precisely the $R a n d T o p M$ algorithm below in Algorithm Figure 4 (page ). It is essentially the same algorithm as $M C T o p M$ , but in a simpler version as the “Chair” aspect is removed, that is, there is no notion of state $s^{j} (t)$ (cf Algorithm Figure 1). Player $j$ is always considered “not fixed”, and a collision always forces a uniform sampling of the next arm from $^{j} (t)$ in the case of $R a n d T o p M$ .

Figure 4: The \mathrm{RandTopM} decentralized learning policy (for a fixed underlying index policy g^j). — Figure 4: The $R a n d T o p M$ decentralized learning policy (for a fixed underlying index policy $g^{j}$ ).

DProofs Elements Related to Regret Upper Bounds

This Appendix includes the main proofs, missing from the content of the article, that yield the regret upper bound. We start by controlling the sub-optimal draws when the $k l - U C B$ indices are used (instead of ${U C B}_{1}$ ), with any of our proposed algorithms ( $M C T o p M$ , $R a n d T o p M$ ) or $R h o R a n d$ . Then we focus on controlling collisions for $M C T o p M$ - $k l - U C B$ .

d.1Control of the Sub-optimal Draws for $k l - U C B$ : Proof of Lemma 4

Fix $k \in M - w o r s t$ and a player $j \in {1, \dots, M}$ . The key observation is that for $M C T o p M$ , $R a n d T o p M$ as well as the $R h o R a n d$ algorithm, it holds that

\begin{matrix} (A^{j} (t) = k) = (A^{j} (t) = k, \exists m \in M - b e s t : g_{m}^{j} (t) < g_{k}^{j} (t)) . \\ (18) \end{matrix}

Indeed, for the three algorithms, an arm selected at time $t + 1$ belongs to the set $^{j} (t)$ of arms with $M$ largest indices. If the sub-optimal arm $k$ is selected at time $t$ , it implies that $k \in^{j} (t)$ , and, because there are $M$ arms in both $M - b e s t$ and $^{j} (t)$ , one of the arms in $M - b e s t$ must be excluded from $^{j} (t)$ . In particular, the index of arm $k$ must be larger than the index of this particular arm $m$ .

Using , one can then upper bound the number of selections of arm $k$ by user $j$ up to round $T$ as

\begin{matrix} E_{μ} [T_{k}^{j} (T)] & = E_{μ} [T \sum t = 1 1 (A^{j} (t) = k)] = T \sum t = 1 P (A^{j} (t) = k) . = T \sum t = 1 P (A^{j} (t) = k, \exists m \in {1, \dots, M} : g_{m^{*}}^{j} (t) < g_{k}^{j} (t)) . \end{matrix}

Considering the relative position of the upper-confidence bound $g_{m^{*}}^{j} (t)$ and the corresponding mean $μ_{m}^{*} = μ_{m^{*}}$ , one can write the decomposition

\begin{matrix} E_{μ} [T_{k}^{j} (T)] & \leq T \sum t = 1 P (A^{j} (t) = k, \exists m \in {1, \dots, M} : g_{m^{*}} (t) \leq g_{k} (t), \forall m \in {1, \dots, M} : g_{m^{*}} (t) \geq μ_{m}^{*}) + T \sum t = 1 P (\exists m \in {1, \dots, M} : g_{m^{*}} (t) < μ_{m}^{*}) \leq T \sum t = 1 P (A^{j} (t) = k, \exists m \in {1, \dots, M} : μ_{m}^{*} \leq g_{k} (t)) + M \sum m = 1 T \sum t = 1 P (g_{m^{*}} (t) < μ_{m}^{*}) \leq T \sum t = 1 P (A^{j} (t) = k, μ_{M^{*}} \leq g_{k} (t)) + M \sum m = 1 T \sum t = 1 P (g_{m^{*}} (t) < μ_{m}^{*}), \end{matrix}

where the last inequality (for the first term) comes from the fact that $μ_{M^{*}}$ is the smallest of the $μ_{m^{*}}$ for $m \in {1, \dots, M}$ .

Now each of the two terms in the right hand side can directly be upper bounded using tools developed by [12] for the analysis of kl-UCB. The rightmost term can be controlled using Lemma 7 below that relies on a self-normalized deviation inequality, whose proof exactly follows from the proof of Fact 1 in Appendix A of [12]. The leftmost term can be controlled using Lemma 8 stated below, that is a direct consequence of the proof of Fact 2 in Appendix A of [12].

Lemma 7 For any arm $k$ , if $g_{k}^{j} (t)$ is the $k l - U C B$ index with exploration function $f (t) = log (t) + 3 log log (t)$ ,

T \sum t = 1 P (g_{k}^{j} (t) < μ_{k}) \leq 3 + 4 e log log (T) .

Denote ${k l}^{'} (x, y)$ the derivative of the function $x \mapsto k l (x, y)$ (for any fixed $y \neq 0, 1$ ).

Lemma 8 For any arms $k$ and $k^{'}$ such that $μ_{k^{'}} > μ_{k}$ , if $g_{k}^{j} (t)$ is the kl-UCB index with exploration function $f (t)$ ,

\begin{matrix} T \sum t = 1 P (A^{j} (t) = k, μ_{k^{'}} \leq g_{k}^{j} (t)) & \leq \frac{f (T)}{k l (μ_{k}, μ_{k^{'}})} + \sqrt{2 π} \sqrt{\frac{{k l}^{'} (μ_{k}, μ_{k^{'}})^{2}}{k l (μ_{k}, μ_{k^{'}})^{3}}} \sqrt{f (T)} + 2 {(\frac{{k l}^{'} (μ_{k}, μ_{k^{'}})}{k l (μ_{k}, μ_{k^{'}})})}^{2} + 1. \end{matrix}

Putting things together, one obtains the non-asymptotic upper bound

\begin{matrix} \begin{matrix} E_{μ} [T_{k}^{j} (T)] & \leq \frac{log (T) + 3 log log (T)}{k l (μ_{k}, μ_{M^{*}})} + \sqrt{2 π} \sqrt{\frac{{k l}^{'} (μ_{k}, μ_{M^{*}})^{2}}{k l (μ_{k}, μ_{M *})^{3}}} \sqrt{log (T) + 3 log log (T)} + 2 {(\frac{{k l}^{'} (μ_{k}, μ_{M^{*}})}{k l (μ_{k}, μ_{M^{*}})})}^{2} + 4 M e log log (T) + 3 M + 1, \end{matrix} \\ (19) \end{matrix}

which yields Lemma 4, with explicit constants $C_{μ}$ and $D_{μ}$ .

d.2Proof of Lemma 5

Using the behavior of the algorithm when the current arm leaves the set $^{j}$ (Line 4), one has

T∑t=1P(Aj(t)=k,k∉^Mj(t))≤T∑t=1P(Aj(t)=k,k∉^Mj(t),Aj(t+1)∈ˆMj(t)∩{k′:gjk′(t−1)≤gjk(t−1)})≤T∑t=1∑k′≠kP(Aj(t)=k,Aj(t+1)=k′,gjk′(t)≥gjk(t),gjk′(t−1)≤gjk(t−1))=∑k′≠kT∑t=1P(Aj(t)=k,Aj(t+1)=k′,gjk′(t)≥gjk(t),gjk′(t−1)≤gjk(t−1)):=Tk′

Now, to control $T_{k^{'}}$ , we distinguish two cases. If $μ_{k} < μ_{k^{'}}$ , one can write

T_{k^{'}} \leq T \sum t = 1 P (g_{k^{'}}^{j} (t) \leq μ_{k^{'}}) + T \sum t = 1 P (A^{j} (t) = k, g_{k}^{j} (t - 1) \geq μ_{k^{'}})

The first term in the right hand side is $o (log (T))$ by Lemma 7. To control the second term, we apply the same trick that led to the proof of Lemma 8 in [12]. Letting ${k l}^{+} (x, y) := k l (x, y) 1 (x \geq y)$ , and ${ˆ μ}_{k, s}^{j}$ be the empirical mean of the $s$ first observations from arm $k$ by player $j$ , one has

\begin{matrix} \begin{matrix} T \sum t = 1 P (A^{j} (t) = k, g_{k}^{j} (t - 1) \geq μ_{k^{'}}) = E [T \sum t = 1 t - 1 \sum s = 1 1 (A^{j} (t) = k, N_{k}^{j} (t - 1) = s) 1 (s \times {k l}^{+} ({ˆ μ}_{k, s}^{j}, μ_{k}) \leq f (t))] \leq E [T \sum s = 1 1 (s \times {k l}^{+} ({ˆ μ}_{k, s}^{j}, μ_{k}) \leq f (T)) T \sum t = s - 1 1 (A^{j} (t) = k, N_{k}^{j} (t - 1) = s)] \leq T \sum s = 1 P (s \times {k l}^{+} ({ˆ μ}_{k, s}^{j}, μ_{k}) \leq f (T)), \end{matrix} \\ (20) \end{matrix}

where the last inequality uses that for all $s$ ,

T \sum t = s - 1 1 (A^{j} (t) = k, N_{k}^{j} (t - 1) = s) = T \sum t = s - 1 1 (A^{j} (t) = k, N_{k}^{j} (t) = s + 1) \leq 1.

From , the same upper bound as that of Lemma 8 can be obtained using the tools from [12], which proves that for $T \to \infty$ ,

T_{k^{'}} = \frac{log (T)}{k l (μ_{k}, μ_{k^{'}})} + o (log (T)) .

If $μ_{k} > μ_{k^{'}}$ , we rather use that

T_{k^{'}} \leq T \sum t = 1 P (g_{k}^{j} (t) \leq μ_{k}) + T \sum t = 1 P (A^{j} (t + 1) = k^{'}, g_{k^{'}}^{j} (t) \geq μ_{k})

and similarly Lemma 7 and a slight variant of Lemma 8 to deal with the modified time indices yields

T_{k^{'}} = \frac{log (T)}{k l (μ_{k^{'}}, μ_{k})} + o (log (T)) .

Summing over $k^{'}$ yields the result.

d.3Controlling Collisions for $M C T o p M$ : Proof of Lemma 5

A key feature of both the $R a n d T o p M$ and $M C T o p M$ algorithms is Lemma 5, that states that the probability of switching from some arm because this arm leaves $^{j} (t)$ is small. Its proof is postponed to the end of this section.

Figure 5 below provides a schematic representation of the execution of the $M C T o p M$ algorithm, that has to be exploited in order to properly control the number of collisions. The sketch of the proof is the following: by focusing only on collisions in the “not fixed” state, bounding the number of transitions $(2)$ and $(3)$ is enough. Then, we show that both the number of transitions $(3)$ and $(5)$ are small: as a consequence of Lemma 5, the average number of these transitions is $O (log T)$ . Finally, we use that the length of a sequence of consecutive transitions $(2)$ is also small (on average smaller than $M$ ), and except for possibly the first one, starting a new sequence implies a previous transition $(3)$ or $(5)$ to arrive in the state “not fixed”. This gives a logarithmic number of transitions $(2)$ and $(3)$ , and so gives $E_{μ} [\sum_{k} C_{k} (T)] = O (log T)$ , with explicit constants depending on $μ$ and $M$ .

Figure 5: Player j using \mathrm{MCTopM}, represented as state machine with 5 transitions. Taking one of the five transitions means playing one round of the Algorithm , to decide A^j(t+1) using information of previous steps. — Figure 5: Player $j$ using $M C T o p M$ , represented as “state machine” with $5$ transitions. Taking one of the five transitions means playing one round of the Algorithm , to decide $A^{j} (t + 1)$ using information of previous steps.

As in Algorithm Figure 1, $s^{j} (t)$ is the event that player $j$ decided to fix herself on an arm at the end of round $t - 1$ . Formally, $s^{j} (0)$ is false, and $s^{j} (t + 1)$ is defined inductively from $s^{j} (t)$ as

s^{j} (t + 1) = (s^{j} (t) \cup (¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t) \cap ¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ C^{j} (t))) \cap (A^{j} (t) \in^{j} (t)) .

For the sake of clarity, we now explain Figure 5 in words. At step $t$ , if player $j$ is not fixed ( $¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t)$ ), she can have three behaviors when executing $M C T o p M$ . She keeps the same arm and goes to the other state $s^{j} (t)$ with transition $(1)$ , or she stays in state $¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t)$ , with two cases. Either she sampled $A^{j} (t + 1)$ uniformly from $^{j} (t) \cap {m : g_{m}^{j} (t) \leq g_{k}^{j} (t)}$ with transition $(3)$ , in case of collision and if $A^{j} (t + 1) \in^{j} (t)$ , or she sampled $A^{j} (t + 1)$ uniformly from $^{j} (t)$ with transition $(2)$ , if $A^{j} (t + 1) \notin^{j} (t)$ . In particular, note that if $¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ C^{j} (t)$ , transition $(3)$ is executed and not $(2)$ . Transition $(3)$ is a uniform sampling from $^{j} (t)$ (the “Musical Chair” step).

For player $j$ and round $t$ , we now introduce a few events that are useful in the proof. First, for every $x = 1, 2, 3, 4, 5$ , we denote $I_{x}^{j} (t)$ the event that a transition of type $(x)$ occurs for player $j$ after the first $t$ observations (i.e., between round $t$ and round $t + 1$ , to decide $A^{j} (t + 1)$ ). Formally they are defined by

\begin{matrix} I_{1}^{j} (t) & := (¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t), ¯ ¯¯¯¯¯¯¯¯¯¯ ¯ C_{j} (t), A^{j} (t) \in^{j} (t)), I_{2}^{j} (t) & := (¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t), C_{j} (t), A^{j} (t) \in^{j} (t)), & and I_{3} (t) & := (¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t), A^{j} (t) \notin^{j} (t)), I_{3} (t) & := (s^{j} (t), A^{j} (t) \in^{j} (t)), & and I_{5} (t) & := (s^{j} (t), A^{j} (t) \notin^{j} (t)) . \end{matrix}

Then, we introduce $^{j} (t)$ as the event that a collision occurs for player $j$ at round $t$ if she is not yet fixed on her arm, that is

^{j} (t) := (C^{j} (t), ¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t)) .

A key observation is that $C^{j} (t)$ implies $⋃_{j^{'} = 1}^{M}^{j^{'}} (t)$ , as a collision necessarily involves at least one player not yet fixed on her arm ( $^{j^{'} (t)}$ ). Otherwise, if they are all fixed, i.e., for all $j$ , $s^{j} (t)$ , then by definition of $s^{j} (t)$ , none of the player changed their arm from $t - 1$ to $t$ , and none experienced any collision at time $t - 1$ so by induction there is no collision at time $t$ . Thus, $\sum_{j = 1}^{M} P (C^{j} (t))$ can be upper bounded by $M \sum_{j = 1}^{M} P (^{j} (t))$ (union bound), and it follows that if $C (T) := \sum_{k = 1}^{K} C_{k} (T)$ then

E_{μ} [C (T)] \leq M M \sum j = 1 T \sum t = 1 P (^{j} (t)) .

We can further observe that $^{j} (t)$ implies a transition $(2)$ or $(3)$ , as a transition $(1)$ cannot happen in case of collision. Thus another union bound gives

\begin{matrix} \begin{matrix} T \sum t = 1 P (^{j} (t)) & \leq T \sum t = 1 P (I_{2}^{j} (t)) + T \sum t = 1 P (I_{3}^{j} (t)) . \end{matrix} \\ (21) \end{matrix}

In the rest of the proof we focus on bounding the number of transitions $(2)$ and $(3)$ .

Let $N_{x}^{j} (T)$ be the random variable denoting the number of transitions of type $(x)$ . Neglecting the event $¯ ¯¯¯¯¯¯¯¯¯ ¯ s^{j} (t)$ for $x = 3$ and $s^{j} (t)$ for $x = 5$ , one has

E_{μ} [N_{x}^{j} (t)] = T \sum t = 1 P (I_{x}^{j} (t)) \leq T \sum t = 1 P (A^{j} (t) \notin^{j} (t)) \leq T \sum t = 1 K \sum k = 1 P (A^{j} (t) = k, k \notin^{j} (t)),

which is $O (log T)$ (with known constants) by Lemma 5. In particular, this controls the second term in the right hand side of .

To control the first term $\sum_{t = 1}^{T} P (I_{2}^{j} (t))$ we introduce three sequences of random variables, the starting times $(θ_{i})_{i \geq 1}$ and the ending times $(τ_{i})_{i \geq 1}$ (possibly larger than $T$ ), of sequences during which $I_{2} (s)$ is true for all $s = θ_{i}, \dots, τ_{i} - 1$ but not before and after, that is $\forall i \in {1, \dots, n (T)}, ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ I_{2}^{j} (θ_{i} - 1) \cap ⋂_{t = θ_{i}}^{τ_{i} - 1} I_{2}^{j} (t) \cap ¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ I_{2}^{j} (τ_{i})$ with $n (T)$ the number of such sequences, i.e., $n (T) := inf {i \geq 1 : min (θ_{i}, τ_{i}) \geq T}$ (or $0$ if $θ_{1}$ does not exist). If $θ_{i} = 1$ , the first sequence does not have term $¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ I_{2}^{j} (θ_{i} - 1)$ .

Now we can decompose the sum on $t = 1, \dots, T$ with the use of consecutive sequences,

\begin{matrix} E_{μ} [N_{2}^{j} (t)] & = E_{μ} [T \sum t = 1 1 (I_{2}^{j} (t))] = E_{μ} ⎡ ⎣ n (T) \sum i = 1 ⎛ ⎝ τ_{i} - 1 \sum t = θ_{i} 1 + θ_{i + 1} - 1 \sum t = τ_{i} 0 ⎞ ⎠ ⎤ ⎦ = E_{μ} ⎡ ⎣ n (T) \sum i = 1 (τ_{i} - θ_{i}) ⎤ ⎦ . \end{matrix}

Both $n (T)$ and $τ_{i} - θ_{i} \geq 0$ have finite averages for any $i$ (as $τ_{i} - θ_{i} \leq T$ ), and $n (T)$ is a stopping time with respect to the past events (that is, $F_{T}^{j}$ ), and so we can obtain

\begin{matrix} \leq E_{μ} [n (T)] \times max i \in N E_{μ} [τ_{i} - θ_{i}] = (α) \times (β) . \end{matrix}

$(α)$ To control $E_{μ} [n (T)]$ , we can observe that the number of sequences $n (T)$ is smaller than $1$ plus the number of times when a sequence begins ( $1$ plus because maybe the game starts in a sequence). And beginning a sequence at time $θ_{i}$ implies $¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ I_{2}^{j} (θ_{i} - 1) \cap I_{2}^{j} (θ_{i})$ , which implies a transition of type $(3)$ or $(5)$ at time $θ_{i} - 1$ , as player j is in state “not fixed” at time $θ_{i}$ (transitions $(1)$ and $(4)$ are impossible). As stated above, $E_{μ} [N_{x}^{j} (T)] = O (log T)$ for both $x = 3$ and $x = 5$ , and so $E_{μ} [n (T)] = O (log T)$ also.

$(β)$ To control $E_{μ} [τ_{i} - θ_{i}]$ , a simple argument can be used. $⋃_{t = θ_{i}}^{τ_{i} - 1} I_{2}^{j} (t)$ implies $C^{j} (t)$ for $τ_{i} - θ_{i}$ consecutive times. The very structure of $R a n d T o p M$ gives that in this sequence of transitions $(2)$ , the successive collisions (i.e., $C^{j} (t - 1) \cap C^{j} (t)$ ) implies that each new arm $A^{j} (t + 1)$ for $t \in {θ_{i}, τ_{i} - 1}$ is selected uniformly from $^{j} (t + 1)$ , a set of size $M$ with at least one available arm. Indeed, as there is $M - 1$ other players, at time $t + 1$ at least one arm in $^{j} (t + 1)$ is not selected by any player $k^{'} \neq k$ , and so player $j$ has at least a probability $1 / M$ to select a free arm, which implies $¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ C^{j} (t + 1)$ , and so implies the end of the sequence. In other words, the average length of sequences of transitions $(2)$ , $E_{μ} [τ_{i} - θ_{i}]$ , is bounded by the expected number of failed trial of a repeated Bernoulli experiment, with probability of success larger than $1 / M$ (by the uniform choice of $A^{j} (t + 1)$ in a set of size $M$ with at least one available arm). We recognize the mean of a geometric random variable, of parameter $λ \geq 1 / M$ , and so $E_{μ} [τ_{i} - θ_{i}] = \frac{1}{λ} \leq \frac{1}{1 / M} = M$ .

This finishes the proof as $E_{μ} [N_{2}^{j} (T)] = \sum_{t = 1}^{T} P (I_{2}^{j} (t)) = O (log T)$ and so $\sum_{t = 1}^{T} P (^{j} (t) \cap (A^{j} (t) = k)) = O (log T)$ and finally $E_{μ} [C (T)] = \sum_{k = 1}^{K} E_{μ} [C^{k} (T)] = O (log T)$ also.

We can be more precise about the constants, all the previous arguments can be used successively:

\begin{matrix} E_{μ} [C (T)] & \leq M M \sum j = 1 (T \sum t = 1 P (I_{2}^{j} (t)) + T \sum t = 1 P (I_{3}^{j} (t))) = M (M \sum j = 1 E_{μ} [N_{2}^{j} (T)] + E_{μ} [N_{3}^{j} (T)]) \leq M^{2} (E_{μ} [n (T)] E_{μ} [θ_{i} - τ_{i}]) + M^{2} E_{μ} [N_{3}^{1} (T)] \leq M^{2} (1 + E_{μ} [N_{3}^{1} (T)] + E_{μ} [N_{5}^{1} (T)]) M + M^{2} E_{μ} [N_{3}^{1} (T)] \leq 2 M^{3} E_{μ} [N_{3}^{1} (T)] + o (log T) + M^{2} ⎛ ⎝ \sum a, b = 1, \dots, K, μ_{a} < μ_{b} \frac{1}{k l (μ_{a}, μ_{b})} ⎞ ⎠ log (T) + o (log T) \leq (2 M^{3} + M^{2}) ⎛ ⎝ \sum a, b = 1, \dots, K, μ_{a} < μ_{b} \frac{1}{k l (μ_{a}, μ_{b})} ⎞ ⎠ log (T) + o (log T) . \end{matrix}

And so we obtain the desired inequality, with explicit constants, that depend only on $μ$ and $M$ .

K \sum k = 1 E_{μ} [C^{k} (T)] = E_{μ} [C (T)] \leq M^{2} (2 M + 1) ⎛ ⎝ \sum a, b = 1, \dots, K, μ_{a} < μ_{b} \frac{1}{k l (μ_{a}, μ_{b})} ⎞ ⎠ log (T) + o (log T) .

Number of switches Note that we controlled the total number of transitions $(2)$ , $(3)$ and $(5)$ , which are the only transitions when a player can switch from arm $k$ to arm $k^{'} \neq k$ . Thus, the total number of arm switches is also proved to be logarithmic, if all players uses the $M C T o p M$ - $k l - U C B$ algorithm.

Strong uniform efficiency As soon as $R_{T} = O (log T)$ for all problem, $M C T o p M$ is clearly proved to be uniformly efficient, as $log T$ is $o (T^{α})$ for any $α \in (0, 1)$ . And as justified after Definition ? (page ), uniform efficiency and invariance under permutations of the users implies strong uniform efficiency, and so $M C T o p M$ satisfies Definition ?. This is a sanity check: the lower-bound of Theorem 1 indeed applies to our algorithm $M C T o p M$ , and finally this highlights that it is order-optimal for the regret, in the sense that it matches the lower-bound up-to a multiplicative constant, and optimal for the term (a).

EAdditionnal Discussions on $S e l f i s h$

As said before, analyzing $S e l f i s h$ is harder, but for instance one can prove that it yields constant collisions and regret for the trivial case of $μ_{1} = μ_{2} = 1$ and $M = 2$ . Empirically, when $S e l f i s h$ is compared to the other algorithms, it is hard to find a case when $S e l f i s h$ performs badly, as its (empirical average) regret always appeared logarithmic. But an issue of only visualizing the empirical average regret for a certain number of repetitions is that if a certain “bad” run happens only with small probability, it is possible that it never happened in a simulation, or that it happened a few times but not enough to make the average regret look non-logarithmic¹⁰. This is why the distribution of regret, at the end of the simulations, $R_{T}$ , is also displayed (in Appendix F). In a simple problem, with $M = 2$ or $M = 3$ players competing for $K = 3$ arms, for instance with means $μ = [0.1, 0.5, 0.9]$ , the histogram in Figure 9 shows that with a small probability, the regret $R_{T}$ of $S e l f i s h$ - $k l - U C B$ is not small (and appears linear). Additionally, Figure 10 shows that $S e l f i s h$ - $k l - U C B$ also has bad performance against random uniform problems $μ \in [0, 1]^{K}$ , for $M = 2$ or $3$ and $K = 3$ , in a lot of cases. In comparison, the others algorithms seem to have a logarithmic regret (for $M = 2$ and all algorithms) or even a constant regret (for $M = 3$ and $R a n d T o p M$ and $M C T o p M$ ).

The intuition behind these configurations when $S e l f i s h$ performs poorly is the following, if all players use the same algorithm and the same indices. If two players $i$ and $j$ have exactly the same vectors $[{_{k}}^{i} (t)] = [{_{k}}^{j} (t)]$ and $[T_{k}^{i} (t)] = [T_{k}^{j} (t)]$ at some time step $t \geq 1$ , with different values for each $k$ and so that the index vectors $g^{j} (t) = g^{i} (t)$ have different values for each arm $k$ , then both players will take the same decisions at time $t$ , and collide. Colliding does not change $[{_{k}}^{j} (t + 1)]$ but increase one value in $[T_{k}^{j} (t + 1)]$ by $1$ . Then at the next step the same conditions on ${˜ S}^{j}$ and $N^{j}$ are preserved, and if the same condition on $g^{j} (t + 1)$ are also preserved, the two players will continue to collide. We did not succeed in proving mathematically that the preservation of the first hypothesis on ${˜ S}^{j}$ and $N^{j}$ implies the preservation of the hypothesis on index $g^{j}$ , but numerically it turns out to be always the case: and so two players colliding in such a setting will continue to do so infinitely: we denote such configurations as absorbing.

We wrote a script ¹¹ that explores formally all the possible runs, up-to a certain small time horizon, by exploring the complete game tree, of possible (random) rewards from the $K$ arms and (random) actions from the $M$ players, up-to a small depth of let say $T = 8$ . Such game tree becomes quickly very large, but this was enough to confirm that with a certain small probability, function of $μ_{1}, \dots, μ_{K}$ , two players can arrive in just a few steps in a “bad” absorbing configuration. For instance, for only $K = 2$ arms, the following game tree in Figure 6 illustrates the first $3$ steps that can lead to $2$ absorbing configurations. Using symbolic computations, the probability of reaching any of them was found to be $\geq μ_{1}^{2} (1 - μ_{2})^{2} / 2 + μ_{2}^{2} (1 - μ_{1})^{2} / 2$ for $S e l f i s h$ - ${U C B}_{1}$ . That is far from being negligible, as it evaluates to $0.328$ for $μ = [0.1, 0.5, 0.9]$ , and a numerical simulation on $1000$ runs found $325$ cases of bad performance. The same game tree exploration can be made for $S e l f i s h$ - $k l - U C B$ , but so far we were not able to justify why it experiences fewer cases of bad performances even though our software found the same (lower bound on) failure probability.

Figure 6: For K=2 arms and M=2 players using \mathrm{Selfish}-\mathrm{UCB}_1, for depth=3: 2 absorbing configurations. Each rectangle represents a configuration, as the matrix [[\widetilde{S_k}^j(t) / T_k^j(t) ]_j]_k. Absorbing configurations from depth 2 are case of equality of the two vectors and the \mathrm{Selfish} indices \widetilde{g_k}^j(t). Transitions are labeled with their probabilities. — Figure 6: For $K = 2$ arms and $M = 2$ players using $S e l f i s h$ - ${U C B}_{1}$ , for depth $= 3$ : $2$ *absorbing configurations*. Each rectangle represents a configuration, as the matrix $[[{_{k}}^{j} (t) / T_{k}^{j} (t)]_{j}]_{k}$ . Absorbing configurations from depth $2$ are case of equality of the two vectors and the $S e l f i s h$ indices ${_{k}}^{j} (t)$ . Transitions are labeled with their probabilities.

From the structure of such game tree, we conjecture that the probability of reaching absorbing configurations (before a certain time $t$ ) is always lower-bounded by a polynomial function of $μ_{1}, \dots, μ_{K}$ and $1 - μ_{1}, \dots, 1 - μ_{K}$ , of degree at most $t$ in each variable. As such, the lower bound on probability of failures should decrease when $K$ and $M$ increase, and this is coherent with the experiments for $K = 9$ or $K = 17$ (see Figures Figure 14 and Figure 15), where $S e l f i s h$ is shown to be uniformly more efficient than $R h o R a n d$ . Of course, one cannot run an infinite number of simulations, and the smaller the probability of failure, the less likely it is to observe a failure in a finite number of runs.

Ideas to fix $S e l f i s h$ ? It could be possible to change the $S e l f i s h$ algorithm to add a way to escape such absorbing trajectories. For instance one could imagine that after seen seeing, e.g., $x = 10$ collisions in a row, a certain random action could be taken by the players. These tricks can work empirically in some cases, but they are harder to analyze formally, and it is hard to tune the parameters (here $x$ , but possibly more), and we do not find such tricks to be promising from a theoretical point-of-view.

FAdditional Figures

The plots missing from Section 5 are included here, as well as some additional numerical results.

f.1Illustration of the lower bound

We proved in Theorem 1 that the normalized regret, i.e., $R_{T}$ divided by $log T$ , is asymptotically lower bounded by a constant $L B (μ, M)$ depending on the problem $μ$ and the number of players $M$ , for any $ρ$ .

lim inf T \to + \infty \frac{R_{T} (μ, M, ρ)}{log T} \geq L B (μ, M) .

For an example problem with $K = 9$ arms, we display below on the $x$ axis is the number of player, from $1$ player to $9$ players, and on the $y$ axis is the value of this constant $L B (μ, M)$ , from the initial theorem and from our theorem. We chose a simple problem, with Bernoulli distributed arms, with $μ = [0.1, 0.2, \dots, 0.9]$ . Figure 7 clearly shows that our improved lower bound is indeed larger than the initial one by [23], and both become uninformative when $M = K$ (i.e., null).

Figure 7: Comparison of our lower bound against the one from , on a simple problem with 9 Bernoulli arms, of means \boldsymbol{\mu} = [0.1, 0.2, \dots, 0.9], as a function of the number of players M. — Figure 7: Comparison of our lower bound against the one from , on a simple problem with $9$ Bernoulli arms, of means $μ = [0.1, 0.2, \dots, 0.9]$ , as a function of the number of players $M$ .

Figure 8: Regret with its three terms (a), (b), (c), and lower bounds and in black, for \mathrm{Selfish}-\mathrm{kl}\text{-}\mathrm{UCB}: M=6 and M=9 players, K=9 arms, horizon T=10000 (for 1000 runs). — Figure 8: Regret with its three terms (a), (b), (c), and lower bounds and in *black*, for $S e l f i s h$ - $k l - U C B$ : $M = 6$ and $M = 9$ players, $K = 9$ arms, horizon $T = 10000$ (for $1000$ runs).

Regret with its three terms (a), (b), (c), and lower bounds and in black, for \mathrm{Selfish}-\mathrm{kl}\text{-}\mathrm{UCB}: M=6 and M=9 players, K=9 arms, horizon T=10000 (for 1000 runs). — Figure 8: Regret with its three terms (a), (b), (c), and lower bounds and in *black*, for $S e l f i s h$ - $k l - U C B$ : $M = 6$ and $M = 9$ players, $K = 9$ arms, horizon $T = 10000$ (for $1000$ runs).

Figures Figure 8 show the regret $R_{T} (μ, M, ρ)$ on the same example problem $μ$ , with $K = 9$ arms and respectively $M = 6$ , or $9$ players, for $S e l f i s h$ - $k l - U C B$ . It is just a simple way to check that the two lower bounds on the regret indeed appear as valid lower bounds empirically, and are moreover lower bounds on the count of selections ((a), displayed in cyan). The lower bounds (in black) are $C (μ, M) log t$ , the dashed line for [23]’s lower bound, and the continuous line is our lower bound. These plot show the regret (in red), and the three terms (a), (b), (c) in the decomposition of the regret. As explained in Lemma 1, term (b) is not always non-negative. For $M = 9$ and $S e l f i s h$ , (c) is actually larger than the regret, and term (a) is zero, as well as the lower bounds.

f.2Figures from Section

This last Appendix includes the figures used in Section 5, with additional comments.

Figure 9: Regret for M=2 players, K=3 arms, horizon T=5000, 1000 repetitions and \boldsymbol{\mu} = [0.1, 0.5, 0.9]. Axis x is for regret (different scale for each part), and the green curve for \mathrm{Selfish} shows a small probability of having a linear regret (17 cases of R_T \geq T, out of 1000). The regret for the three other algorithms is very small for this problem, always smaller than 100 here. — Figure 9: Regret for $M = 2$ players, $K = 3$ arms, horizon $T = 5000$ , $1000$ repetitions and $μ = [0.1, 0.5, 0.9]$ . Axis $x$ is for regret (different scale for each part), and the green curve for $S e l f i s h$ shows a small probability of having a linear regret ( $17$ cases of $R_{T} \geq T$ , out of $1000$ ). The regret for the three other algorithms is very small for this problem, always smaller than $100$ here.

Figure 10: Regret, M=2 and M=3 players, K=3 arms, horizon T=5000, against 1000 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. \mathrm{Selfish} (top curve in green) clearly fails in such setting with small K. — Figure 10: Regret, $M = 2$ and $M = 3$ players, $K = 3$ arms, horizon $T = 5000$ , against $1000$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . $S e l f i s h$ (top curve in green) clearly fails in such setting with small $K$ .

Regret, M=2 and M=3 players, K=3 arms, horizon T=5000, against 1000 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. \mathrm{Selfish} (top curve in green) clearly fails in such setting with small K. — Figure 10: Regret, $M = 2$ and $M = 3$ players, $K = 3$ arms, horizon $T = 5000$ , against $1000$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . $S e l f i s h$ (top curve in green) clearly fails in such setting with small $K$ .

Figure 11: Regret for M=3 players, K=3 arms, horizon T=5000, 1000 repetitions and \boldsymbol{\mu} = [0.1, 0.5, 0.9]. Axis x is for regret (different scale for each), and the top green curve for \mathrm{Selfish} shows a small probability of having a linear regret (11 cases of R_T \geq T, out of 1000). The regret for the three other algorithms is very small for this problem, and even appears constant. — Figure 11: Regret for $M = 3$ players, $K = 3$ arms, horizon $T = 5000$ , $1000$ repetitions and $μ = [0.1, 0.5, 0.9]$ . Axis $x$ is for regret (different scale for each), and the top green curve for $S e l f i s h$ shows a small probability of having a linear regret ( $11$ cases of $R_{T} \geq T$ , out of $1000$ ). The regret for the three other algorithms is very small for this problem, and even appears constant.

Regret for M=3 players, K=3 arms, horizon T=5000, 1000 repetitions and \boldsymbol{\mu} = [0.1, 0.5, 0.9]. Axis x is for regret (different scale for each), and the top green curve for \mathrm{Selfish} shows a small probability of having a linear regret (11 cases of R_T \geq T, out of 1000). The regret for the three other algorithms is very small for this problem, and even appears constant. — Figure 11: Regret for $M = 3$ players, $K = 3$ arms, horizon $T = 5000$ , $1000$ repetitions and $μ = [0.1, 0.5, 0.9]$ . Axis $x$ is for regret (different scale for each), and the top green curve for $S e l f i s h$ shows a small probability of having a linear regret ( $11$ cases of $R_{T} \geq T$ , out of $1000$ ). The regret for the three other algorithms is very small for this problem, and even appears constant.

Figure 12: Regret, M=2 players, K=9 arms, horizon T=5000, against 500 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. \mathrm{Rho}\mathrm{Rand} (top blue) is outperformed by the other algorithms (and the gain increases when M increases), which all perform similarly in such configurations. Note that the (small) tail of the histograms come from complicated problems \boldsymbol{\mu} and not failure cases. — Figure 12: Regret, $M = 2$ players, $K = 9$ arms, horizon $T = 5000$ , against $500$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . $R h o R a n d$ (top blue) is outperformed by the other algorithms (and the gain increases when $M$ increases), which all perform similarly in such configurations. Note that the (small) tail of the histograms come from complicated problems $μ$ and not failure cases.

Regret, M=2 players, K=9 arms, horizon T=5000, against 500 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. \mathrm{Rho}\mathrm{Rand} (top blue) is outperformed by the other algorithms (and the gain increases when M increases), which all perform similarly in such configurations. Note that the (small) tail of the histograms come from complicated problems \boldsymbol{\mu} and not failure cases. — Figure 12: Regret, $M = 2$ players, $K = 9$ arms, horizon $T = 5000$ , against $500$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . $R h o R a n d$ (top blue) is outperformed by the other algorithms (and the gain increases when $M$ increases), which all perform similarly in such configurations. Note that the (small) tail of the histograms come from complicated problems $μ$ and not failure cases.

Figure 13: Regret, M=9 players for K=9 arms, horizon T=5000, against 500 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. This extreme case M=K shows the drastic difference of behavior between \mathrm{RandTopM} and \mathrm{MCTopM}, having constant regret, and \mathrm{Rho}\mathrm{Rand} and \mathrm{Selfish}, having large regret. — Figure 13: Regret, $M = 9$ players for $K = 9$ arms, horizon $T = 5000$ , against $500$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . This extreme case $M = K$ shows the drastic difference of behavior between $R a n d T o p M$ and $M C T o p M$ , having constant regret, and $R h o R a n d$ and $S e l f i s h$ , having large regret.

Regret, M=9 players for K=9 arms, horizon T=5000, against 500 problems \boldsymbol{\mu} uniformly sampled in [0,1]^K. This extreme case M=K shows the drastic difference of behavior between \mathrm{RandTopM} and \mathrm{MCTopM}, having constant regret, and \mathrm{Rho}\mathrm{Rand} and \mathrm{Selfish}, having large regret. — Figure 13: Regret, $M = 9$ players for $K = 9$ arms, horizon $T = 5000$ , against $500$ problems $μ$ uniformly sampled in $[0, 1]^{K}$ . This extreme case $M = K$ shows the drastic difference of behavior between $R a n d T o p M$ and $M C T o p M$ , having constant regret, and $R h o R a n d$ and $S e l f i s h$ , having large regret.

Figure 14: Regret (in \log\log scale), for M=2 and 9 players for K=9 arms, horizon T=5000, for problem \boldsymbol{\mu}=[0.1,\dots,0.9]. In different settings, \mathrm{RandTopM} (yellow curve) and \mathrm{Selfish} (green) can outperform each other, and always outperform \mathrm{Rho}\mathrm{Rand}. \mathrm{MCTopM} is always among the best algorithms, and for M not too small, its regret seems logarithmic with a constant matching the lower bound. — Figure 14: Regret (in $log log$ scale), for $M = 2$ and $9$ players for $K = 9$ arms, horizon $T = 5000$ , for problem $μ = [0.1, \dots, 0.9]$ . In different settings, $R a n d T o p M$ (yellow curve) and $S e l f i s h$ (green) can outperform each other, and always outperform $R h o R a n d$ . $M C T o p M$ is always among the best algorithms, and for $M$ not too small, its regret seems logarithmic with a constant matching the lower bound.

Regret (in \log\log scale), for M=2 and 9 players for K=9 arms, horizon T=5000, for problem \boldsymbol{\mu}=[0.1,\dots,0.9]. In different settings, \mathrm{RandTopM} (yellow curve) and \mathrm{Selfish} (green) can outperform each other, and always outperform \mathrm{Rho}\mathrm{Rand}. \mathrm{MCTopM} is always among the best algorithms, and for M not too small, its regret seems logarithmic with a constant matching the lower bound. — Figure 14: Regret (in $log log$ scale), for $M = 2$ and $9$ players for $K = 9$ arms, horizon $T = 5000$ , for problem $μ = [0.1, \dots, 0.9]$ . In different settings, $R a n d T o p M$ (yellow curve) and $S e l f i s h$ (green) can outperform each other, and always outperform $R h o R a n d$ . $M C T o p M$ is always among the best algorithms, and for $M$ not too small, its regret seems logarithmic with a constant matching the lower bound.

Figure 15: Regret (in \log\log scale), for M=6, 12, 17 players for a difficult problem with K=17, and T=5000. The same observation as in Figure can be made. \mathrm{Selfish} outperforms \mathrm{MCTopM} for M=2 here. Additionally, \mathrm{MCTopM} is the only algorithm to not fail dramatically when M=K here. — Figure 15: Regret (in $log log$ scale), for $M = 6, 12, 17$ players for a “difficult” problem with $K = 17$ , and $T = 5000$ . The same observation as in Figure can be made. $S e l f i s h$ outperforms $M C T o p M$ for $M = 2$ here. Additionally, $M C T o p M$ is the only algorithm to not fail dramatically when $M = K$ here.

Regret (in \log\log scale), for M=6, 12, 17 players for a difficult problem with K=17, and T=5000. The same observation as in Figure can be made. \mathrm{Selfish} outperforms \mathrm{MCTopM} for M=2 here. Additionally, \mathrm{MCTopM} is the only algorithm to not fail dramatically when M=K here. — Figure 15: Regret (in $log log$ scale), for $M = 6, 12, 17$ players for a “difficult” problem with $K = 17$ , and $T = 5000$ . The same observation as in Figure can be made. $S e l f i s h$ outperforms $M C T o p M$ for $M = 2$ here. Additionally, $M C T o p M$ is the only algorithm to not fail dramatically when $M = K$ here.

Note the simulation code used for the experiments is using Python 3. It is open-sourced at https://GitHub.com/SMPyBandits/SMPyBandits, and documented at https://smpybandits.github.io/.

Footnotes

This provides another reason to focus on the Bernoulli model. It is the hardest model, in the sense that receiving a reward zero is not enough to detect collisions. For other models, the data streams $(Y_{k, s})_{s}$ are usually continuously distributed, with no mass at zero. Hence receiving $r^{j} (t) = 0$ directly gives $1 (C^{j} (t)) = 1$ .
When $n$ players choose arm $k$ at time $t$ , this counts as $n$ collisions, not just one. So $C_{k} (T)$ counts the total number of colliding players rather than the number of collision events. There is small abuse of notation when calling it a number of collisions.
With some arguments used in the proof of Lemma 2 to circumvent the fact that (b) may be negative.
The decentralized algorithms were also compared to a centralized multiple-play ${U C B}_{1}$ or $k l - U C B$ , as defined by [5], essentially to check that the “price of decentralized learning” is not too large, as our lower bound proved.
But of course the means are unknown to the algorithms, and their order is not important.
Introducing switching costs, like it was done in previous works, e.g., [29], could be an interesting future work.
For instance, $M C T o p M$ , $R a n d T o p M$ and $R h o R a n d$ draws from a uniform variable in ${1, \dots, M}$ for new ranks or arms.
$σ (I_{t})$ denotes the sigma-Algebra generated by the observations $I_{t}$ .
In the work of [14], the statement is more general and the probability of an event $A$ is replaced by the expectation of any $F_{T}$ -measurable random variable $Z$ bounded in $[0, 1]$ .
For instance, $0.999 log (T) + 0.001 T$ looks more like $log (T)$ than $T$ so an event yielding linear regret with “small” probability $10^{- 3}$ cannot be observed from a plot showing the average regret.
Available at http://banditslilian.gforge.inria.fr/docs/complete_tree_exploration_for_MP_bandits.html

References

Sample mean based index policies by $O (log n)$ regret for the Multi-Armed Bandit problem.
R. Agrawal. Advances in Applied Probability
Analysis of Thompson sampling for the Multi-Armed Bandit problem.
S. Agrawal and N. Goyal. In JMLR, Conference On Learning Theory, 2012.
Opportunistic Spectrum Access with multiple users: Learning under competition.
A. Anandkumar, N. Michael, and A. K. Tang. In IEEE INFOCOM, 2010.
Distributed algorithms for learning and cognitive medium access with logarithmic regret.
A. Anandkumar, N. Michael, A. K. Tang, and S. Agrawal. IEEE Journal on Selected Areas in Communications
Asymptotically efficient allocation rules for the Multi-Armed Bandit problem with multiple plays - Part I: IID rewards.
V. Anantharam, P. Varaiya, and J. Walrand. IEEE Transactions on Automatic Control
Finite-time Analysis of the Multi-armed Bandit Problem.
P. Auer, N. Cesa-Bianchi, and P. Fischer. Machine Learning
The Non-Stochastic Multi Armed Bandit Problem.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. SIAM Journal of Computing
Learning to Coordinate Without Communication in Multi-User Multi-Armed Bandit Problems.
O. Avner and S. Mannor. arXiv preprint arXiv:1504.08167
Multi-user lax communications: a Multi-Armed Bandit approach.
O. Avner and S. Mannor. In IEEE INFOCOM. IEEE, 2016.
Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings.
R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot. In 12th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication, CROWNCOM Proceedings, 2017.
Regret Analysis of Stochastic and Non-Stochastic Multi-Armed Bandit Problems.
S. Bubeck, N. Cesa-Bianchi, et al. Foundations and Trends in Machine Learning
Kullback-Leibler upper confidence bounds for optimal sequential allocation.
O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Annals of Statistics
Simple and scalable response prediction for display advertising.
O. Chapelle, E. Manavoglu, and R. Rosales. Transactions on Intelligent Systems and Technology
Explore First, Exploit Next: The True Shape of Regret in Bandit Problems.
A. Garivier, P. Ménard, and G. Stoltz. arXiv preprint arXiv:1602.07182
Multi-Armed Bandit based policies for Cognitive Radio’s decision making issues.
W. Jouini, D. Ernst, C. Moy, and J. Palicot. In International Conference Signals, Circuits and Systems. IEEE, 2009.
Upper Confidence Bound Based Decision Making Strategies and Dynamic Spectrum Access.
W. Jouini, D. Ernst, C. Moy, and J. Palicot. In IEEE International Conference on Communications, 2010.
Decentralized Learning for Multi-Player Multi-Armed Bandits.
D. Kalathil, N. Nayyar, and R. Jain. In IEEE Conference on Decision and Control, 2012.
On Bayesian Upper Confidence Bounds for Bandit Problems.
E. Kaufmann, O. Cappé, and A. Garivier. In AISTATS, pages 592–600, 2012a.
Thompson sampling: an asymptotically optimal finite-time analysis.
E. Kaufmann, N. Korda, and R. Munos. 2012b.
Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-Armed Bandit Problem with Multiple Plays.
J. Komiyama, J. Honda, and H. Nakagawa. In International Conference on Machine Learning, volume 37, pages 1152–1161, 2015.
Asymptotically Efficient Adaptive Allocation Rules.
T. L. Lai and H. Robbins. Advances in Applied Mathematics
A contextual-bandit approach to personalized news article recommendation.
L. Li, W. Chu, J. Langford, and R. E. Schapire. In Proceedings of the 19th international conference on World Wide Web, pages 661–670. ACM, 2010.
Distributed learning in Multi-Armed Bandit with multiple players.
K. Liu and Q. Zhao. IEEE Transaction on Signal Processing
Cognitive Radio: making software radios more personal.
J. Mitola and G. Q. Maguire. IEEE Personal Communications
Some aspects of the sequential design of experiments.
H. Robbins. Bulletin of the American Mathematical Society
Multi-Player Bandits – A Musical Chairs Approach.
J. Rosenski, O. Shamir, and L. Szlak. In International Conference on Machine Learning, pages 155–163, 2016.
Online Learning in Decentralized Multi-User Spectrum Access with Synchronized Explorations.
C. Tekin and M. Liu. In IEEE Military Communications Conference, 2012.
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.
W. R. Thompson. Biometrika
Bandits with Movement Costs and Adaptive Pricing.
K. Tomer, L. Roi, and Y.Mansour. In 30th Annual Conference on Learning Theory (COLT), volume 65 of JMLR Workshop and Conference Proceedings, pages 1242–1268, 2017.

Available at http://banditslilian.gforge.inria.fr/docs/complete_tree_exploration_for_MP_bandits.html