Index of /publis/slides/2018_01__ENSAI_Seminar_BrownBag__Article_ALT2018

Name	Last modified	Size
Parent Directory		-
.gitignore	2021-03-04 16:22	7
Makefile	2021-03-04 16:22	1.6K
README.md	2021-03-04 16:22	16K
slides_169.pdf	2021-03-04 16:22	1.0M
slides_169.pdfpc	2021-03-04 16:22	36
slides_169.synctex.gz	2021-03-04 16:22	127K
slides_169.tex	2021-03-04 16:22	45K
---
author:
  __Lilian Besson__ *Advised by Christophe Moy Émilie Kaufmann
title: Multi-Player Bandits Models Revisited
subtitle: Decentralized Multi-Player Multi-Arm Bandits
institute:
  PhD Student,
  Team SCEE, IETR, CentraleSupélec, Rennes
  & Team SequeL, CRIStAL, Inria, Lille
date: ENSAI Seminar  -  24 January 2017
lang: english
---

# Motivation
We control some communicating devices,
they want to access to a single base station.

- Insert them in a **crowded wireless network**.
- With a protocol **slotted in both time and frequency**.

## Goal
- Maintain a **good Quality of Service**.
- **With no centralized control** as it costs network overhead.

## How?
- Devices can choose a different radio channel at each time
  → learn the best one with sequential algorithm!

----

# Outline and reference

1. Introduction
2. Our model: 3 different feedback levels
3. Decomposition and lower bound on regret
4. Quick reminder on single-player MAB algorithms
5. Two new multi-player decentralized algorithms
6. Upper bounds on regret for MCTopM
7. Experimental results
8. An heuristic (Selfish), and disappointing results
9. Conclusion

This is based on our latest article:

- "Multi-Player Bandits Models Revisited", Besson & Kaufmann. [arXiv:1711.02317](https://arXiv.org/abs/1711.02317)

----

# Our model
- K radio channels (*e.g.*, 10) (*known*)
- Discrete and synchronized time t >= 1. Every time frame t is:

![Protocol in time and frequency, with an Acknowledgement](figures/protocol.png)

## Dynamic device = dynamic radio reconfiguration
- It decides **each time** the channel it uses to send **each packet**.
- It can implement a simple **decision algorithm**.

----

# Our model
## "Easy" case
- M <= K devices **always communicate** and try to access the network,
  *independently* without centralized supervision,
- Background traffic is *i.i.d.*.

## Two variants : with or without *sensing*
1. *With sensing*:
    Device first senses for presence of Primary Users (background traffic), then use `Ack` to detect collisions.
    Model the "classical" Opportunistic Spectrum Access problem.
    Not exactly suited for *Internet of Things*, but can model ZigBee, and can be analyzed mathematically...

2. *Without sensing*: same background traffic, but cannot sense, so only `Ack` is used.
    More suited for "IoT" networks like LoRa or SigFox.
    (Harder to analyze mathematically.)

----

# Background traffic, and rewards
## *i.i.d.* background traffic
- K channels, modeled as Bernoulli (0/1) distributions of mean µ_k
  = background traffic from *Primary Users*, bothering the dynamic devices,
- M devices, each uses channel A^j(t) in {1,...,K} at time t.

## Rewards
r^j(t) := Y_{A^j(t),t} × 1(not C^j(t)) = 1(uplink & Ack).

- with sensing information for all k, Y_{k,t} ~ Bern(µ_k) in {0, 1},
- collision for device j :
  C^j(t) = 1(*alone on arm A^j(t)*).
  → joint binary reward **but not** from two Bernoulli!

----

# 3 feedback levels
r^j(t)} := Y_{A^j(t),t} × 1(not C^j(t))

1. "Full feedback": observe both Y_{A^j(t),t} *and* C^j(t) separately,
    → Not realistic enough, we don't focus on it.

2. "Sensing": first observe Y_{A^j(t),t}, *then* C^j(t) only if Y_{A^j(t),t} != 0,
    → Models licensed protocols (ex. ZigBee), our main focus.

3. "No sensing": observe only the joint Y_{A^j(t),t} × 1(not C^j(t)),
    → Unlicensed protocols (ex. LoRaWAN), harder to analyze !

But all consider the same instantaneous reward r^j(t).

----

# Goal
## Problem
- *Goal* : *minimize packet loss ratio* (= maximize nb of received `Ack`)
  in a *finite-space discrete-time Decision Making Problem*.
- *Solution ?* **Multi-Armed Bandit algorithms**,
  **decentralized** and used **independently** by each dynamic device.

## *Decentralized* reinforcement learning optimization!
- Max transmission rate = **max cumulated rewards**
  max_{text{algorithm} A}    sum_{t=1}^{T} sum_{j=1}^M r^j_{A(t)}.
- Each player wants to **maximize its cumulated reward**,
- With no central control, and no exchange of information,
- Only possible if : each player converges to one of the M best arms,
  orthogonally (without collisions).

----

# Centralized regret
## A measure of success
- Not the network throughput or collision probability,
- We study the **centralized** (expected) **regret**:

  R_T(µ, M, rho)
  := E_µ[ sum_{t=1}^T sum_{j=1}^M µ_j^* -  r^j(t)]
  = (sum_{k=1}^{M}µ_k^*) T - E_µ[sum_{t=1}^T sum_{j=1}^M r^j(t)]

## Two directions of analysis
- Clearly R_T = O(T), but we want a sub-linear regret, as small as possible!
- *How good a decentralized algorithm can be in this setting?*
  → **Lower Bound** on regret, for **any** algorithm !
- *How good is my decentralized algorithm in this setting?*
  → **Upper Bound** on regret, for **one** algorithm !

----

# Lower bound

1. Decomposition of regret in 3 terms,
2. Asymptotic lower bound of one term,
3. And for regret,
4. Sketch of proof,
5. Illustration.

----

# Decomposition on regret

## Decomposition
For any algorithm, decentralized or not, we have

R_T(µ, M, rho) = sum_{k  in  M-worst} (µ_M^* -  µ_k) E_µ[T_k(T)] + sum_{k  in  M-best} (µ_k -  µ_M^*) (T - E_µ[T_k(T)]) + sum_{k=1}^{K} µ_k E_µ[C_k(T)].


## Small regret can be attained if...


1. Devices can quickly identify the bad arms M-worst, and not play them too much
   (*number of sub-optimal selections*),
2. Devices can quickly identify the best arms, and most surely play them
   (*number of optimal non-selections*),
3. Devices can use orthogonal channels
   (*number of collisions*).

# Asymptotic Lower Bound on regret

## 3 terms to lower bound...
- The first term for sub-optimal arms selections
  is lower bounded asymptotically,
  for all  player  j, bad arm k,  lim inf (T → oo) E_µ[T_k^j(T)] / \log T >= 1 / kl(µ_k, µ_M^*),
  using technical information theory tools
  (Kullback-Leibler divergence, entropy),
- And we lower bound the rest (including collisions) by... 0
  T - E_µ[T_k(T)] >= 0   and E_µ[C_k(T)] >= 0,
  😭 we should be able to do better!

# Asymptotic Lower Bound on regret

## Theorem 1  [Besson & Kaufmann, 2017]
- For any uniformly efficient decentralized policy, and any non-degenerated problem  µ,

 lim inf (T → oo) R_T(µ, M, rho) / log(T) >= M × ( sum_{k  in  M-worst} (µ_M^* -  µ_k) / kl(µ_k, µ_M^*) ) .

Where kl(x,y) := x log(x / y) + (1 - x) log(1-x / 1-y) is the *binary* Kullback-Leibler divergence.


## Remarks
- The centralized *multiple-play* lower bound is the same without the M multiplicative factor...
  Ref: [Anantharam et al, 1987]
  → "price of non-coordination" = M = nb of player?
- Improved state-of-the-art lower bound, but still not perfect: collisions should also be controlled!

----

# Illustration of the Lower Bound on regret

![Any such lower bound is very asymptotic, usually not satisfied for small horizons. We can see the importance of the collisions!](figures/main_RegretCentralized____env3-4_2092905764868974160.png)

----

# Sketch of the proof

- Like for single-player bandit, focus on E_µ[T_k^j(T)] expected number of selections of any sub-optimal arm k.
- Same information-theoretic tools, using a "change of law" lemma. Ref: [Garivier et al, 2016]
- It improved the state-of-the-art because of our decomposition, not because of new tools.

> → See our paper for details!

----

# Single-player MAB algorithms

1. Index-based MAB deterministic policies,
2. Upper Confidence Bound algorithm : UCB,
3. Kullback-Leibler UCB algorithm : klUCB.

----

# Upper Confidence Bound algorithm UCB1
The device keep t number of sent packets, T_k(t) selections of channel k, X_k(t) successful transmissions in channel k.

1. For the first K steps (t=1,...,K), try each channel *once*.
2. Then for the next steps t > K :
    - Compute the index g_k(t) := X_k(t) / T_k(t) + sqrt{log(t) / 2 T_k(t)}
    - Choose channel A(t) =  arg max_k   g_k(t),
    - Update T_k(t+1) and X_k(t+1).

References: [Lai & Robbins, 1985], [Auer et al, 2002], [Bubeck & Cesa-Bianchi, 2012]

----

# Kullback-Leibler UCB algorithm klUCB
The device keep t number of sent packets, T_k(t) selections of channel k, X_k(t) successful transmissions in channel k.

1. For the first K steps (t=1,...,K), try each channel *once*.
2. Then for the next steps t > K :
    - Compute the index g_k(t) := sup_{q  in  [a, b]} { q : kl(X_k(t) / T_k(t), q) <= log(t) / T_k(t) }
    - Choose channel A(t) =  arg max_k   g_k(t),
    - Update T_k(t+1) and X_k(t+1).

*Why bother?* klUCB is proved to be more efficient than UCB,
and asymptotically optimal for single-player stochastic bandit.

References: [Garivier & Cappé, 2011], [Cappé & Garivier & Maillard & Munos & Stoltz, 2013]

----

# Multi-player decentralized algorithms

1. Common building blocks of previous algorithms,
2. First proposal: RandTopM,
3. Second proposal: MCTopM,
4. Algorithm and illustration.

----

# Algorithms for this easier model

## Building blocks : separate the two aspects
1. **MAB policy** to learn the best arms (use sensing Y_{A^j(t),t}),
2. **Orthogonalization scheme** to avoid collisions (use C^j(t)).

## Many different proposals for *decentralized* learning policies
- Recent: MEGA and MusicalChair, [Avner & Mannor, 2015], [Shamir et al, 2016]
- State-of-the-art: rhoRand policy and variants. [Anandkumar et al, 2011]

## **Our proposals**: [Besson & Kaufmann, 2017]
- *With sensing*: RandTopM and MCTopM are sort of mixes between rhoRand and MusicalChair, using UCB indexes or more efficient index policies (klUCB),
- *Without sensing*: Selfish use a UCB index directly on the reward r^j(t).

----

# A first decentralized algorithm

![The MCTopM algorithm](figures/algo1.png)

----

# The RandTopM algorithm

![The MCTopM algorithm](figures/algo2.png)

----

# The MCTopM algorithm

![The MCTopM algorithm](figures/algo3.png)

----

# The MCTopM algorithm

![The MCTopM algorithm](figures/algo4.png)

----

# Regret upper bound

1. Theorem,
2. Remarks,
3. Idea of the proof.

----

# Regret upper bound for MCTopM

## Theorem 2  [Besson & Kaufmann, 2017]
- If all M players use MCTopM with klUCB,
  then for any non-degenerated problem µ,
  there exists a problem dependent constant G_{M,µ}
  , such that the regret satisfies:

    R_T(µ, M, rho) \leq G_{M,µ} log(T) + \smallO{\log T}.


## How?
- Decomposition of regret controlled with two terms,
- Control both terms, both are logarithmic:
    + Suboptimal selections with the "classical analysis" on klUCB indexes
    + Collisions are harder to control...

----

# Regret upper bound for MCTopM

## Remarks
- Hard to prove, we had to carefully design the MCTopM algorithm to conclude the proof,
- The constant G_{M,µ} scales as M^3, way better than rhoRand's constant scaling as M{2M-1 \choose M},
- We also *minimize the number of channel switching*: interesting as changing arm costs energy in radio systems,
- For the suboptimal selections, we *match our lower bound* !
- Not yet possible to know what is the best possible control of collisions...

----

# Sketch of the proof

1. Bound the expected number of collisions by M times the number of collisions for non-sitted players,
2. Bound the expected number of transitions of type (3) and (5), by \bigO{\log T} using the klUCB indexes and the forced choice of the algorithm:
    g_k^j(t-1) \leq g^j_{k'}(t-1),   and}   g_k^j(t) > g^j_{k'}(t)
    when switching from k' to k,
3. Bound the expected length of a sequence in the non-sitted state by a constant,
4. So most of the times (O(T - log T)), players are sitted, and no collision happens when they are all sitted!

> → See our paper for details!

----

# Experimental results

> Experiments on Bernoulli problems µ in [0,1]^K.

1. Illustration of regret for a single problem and M = K,
2. Regret for uniformly sampled problems and M < K,
3. Logarithmic number of collisions,
4. Logarithmic number of arm switches,
5. Fairness?

----

# Constant regret if M = K

Regret, M=9 players, K=9 arms, horizon T=10000, 200 repetitions. Only RandTopM and MCTopM achieve constant regret in this saturated case (proved).

![](figures/MP__K9_M9_T10000_N200__4_algos/all_RegretCentralized____env1-1_2306423191427933958.png)

# Illustration of regret of different algorithms

Regret, M=6 players, K=9 arms, horizon T=5000, against 500 problems µ uniformly sampled in [0,1]^K. Conclusion : rhoRand < RandTopM < Selfish < MCTopM in most cases.

![](figures/MP__K9_M6_T5000_N500__4_algos/all_RegretCentralized____env1-1_8318947830261751207.png)

# Logarithmic number of collisions

Cumulated number of collisions. Also rhoRand < RandTopM < Selfish < MCTopM in most cases.

![](figures/MP__K9_M6_T5000_N500__4_algos/all_CumNbCollisions____env1-1_8318947830261751207.png)

# Logarithmic number of arm switches

Cumulated number of arm switches. Again rhoRand < RandTopM < Selfish < MCTopM, but no guarantee for rhoRand.

![](figures/MP__K9_M6_T5000_N500__4_algos/all_CumNbSwitchs____env1-1_8318947830261751207.png)

# Fairness

Measure of fairness among player. All 4 algorithms seem fair in average, but none is fair on a single run It's quite hard to achieve both efficiency and single-run fairness!

![](figures/MP__K9_M6_T5000_N500__4_algos/all_FairnessSTD____env1-1_8318947830261751207.png)


----

# An heuristic, Selfish

For the harder feedback model, without sensing.

1. Just an heuristic,
2. Problems with Selfish,
5. Illustration of failure cases.

----

# The Selfish heuristic
The Selfish decentralized approach = device don't use sensing, just learn on the reward (acknowledgement or not, r^j(t)).

Reference: [Bonnefoi & Besson et al, 2017]

## Works fine...
- More suited to model IoT networks,
- Use less information, and don't know the value of M: we expect Selfish to not have stronger guarantees.
- It works fine in practice!

## *But why would it work?*
- Sensing was *i.i.d.* so using UCB to learn the µ_k makes sense,
- But collisions are not *i.i.d.*,
- Adversarial algorithms are more appropriate here,
- But empirically, Selfish with UCB or klUCB works much better than, *e.g.*, Exp3...

## Works fine...
- Except... when it fails drastically! 😭
- In small problems with M and K = 2 or 3, we found small probability of failures (*i.e.*, linear regret), and this prevents from having a generic upper bound on regret for Selfish.

----

# Illustration of failing cases for Selfish

![Regret for M=2, K=3, T=5000, 1000 repetitions and µ = [0.1, 0.5, 0.9]. Axis x is for regret (different scale for each), and Selfish have a small probability of failure (17/1000 cases of R_T \gg \log T). The regret for the three other algorithms is very small for this "easy" problem.](figures/MP__K3_M2_T5000_N1000__4_algos/all_HistogramsRegret____env1-1_5016720151160452442.png)

----

# Sum-up
## *Wait, what was the problem ?*
- MAB algorithms have guarantees for *i.i.d. settings*,
- But here the collisions cancel the *i.i.d.* hypothesis...
- Not easy to obtain guarantees in this mixed setting
  (*i.i.d.* emissions process, "game theoretic" collisions).

## Theoretical results
- With sensing ("OSA"), we obtained strong results: a lower bound, and an order-optimal algorithm,
- But without sensing ("IoT"), it is harder... our heuristic Selfish usually works but can fail!

----

# Other directions of future work

## Conclude the Multi-Player OSA analysis
- Remove hypothesis that objects know M,
- Allow arrival/departure of objects,
- Non-stationarity of background traffic etc

- *More realistic emission model*:
  maybe driven by number of packets in a whole day,
  instead of emission probability.

## Extend to more objects M > K
- Extend the theoretical analysis to the large-scale IoT model,
  first with sensing (*e.g.*, models ZigBee networks),
  then without sensing (*e.g.*, LoRaWAN networks).

----

# Conclusion

- In a wireless network with an *i.i.d.* background traffic in K channels,
- M devices can use both sensing and acknowledgement feedback, to learn the most free channels and to find orthogonal configurations.

## We showed 😀

- Decentralized bandit algorithms can solve this problem,
- We have a lower bound for any decentralized algorithm,
- And we proposed an order-optimal algorithm, based on klUCB and an improved Musical Chair scheme, MCTopM

## But more work is still needed... 😕

- **Theoretical guarantees** are still missing for the "IoT" model (without sensing), and can be improved (slightly) for the "OSA" model (with sensing).
- Maybe study **other emission models**...
- Implement and test this on **real-world radio devices**
  → demo (in progress) for the ICT 2018 conference!

## **Thanks!** 😀

*Any question or idea ?*