MAB Learning in IoT Networks. Decentralized Multi-Player Multi-Arm Bandits | 23 Nov 2017

count: true

### Motivation: *Internet of Things* problem
A *lot* of IoT devices want to access to a single base station.

- Insert them in a possibly **crowded wireless network**.
- With a protocol **slotted in both time and frequency**.
- Each device has a **low duty cycle** (a few messages per day).

#### Goal
- Maintain a **good Quality of Service**.
- **Without** centralized supervision!

#### How?
- Use **learning algorithms**: devices will learn on which frequency they should talk!

---

### Outline and references
1. Introduction and motivation
2. Model and hypotheses
3. Baseline algorithms : to compare against naive and efficient centralized approaches
4. Two Multi-Armed Bandit algorithms : UCB, TS
5. Experimental results
6. An easier model with theoretical results
7. Perspectives and future works

Main references are my recent articles (on HAL):

- [*Multi-Armed Bandit Learning in IoT Networks and non-stationary settings*, Bonnefoi et al. CrownCom 2017](https://hal.inria.fr/hal-01575419),
- [*Multi-Player Bandits Models Revisited*, Besson, Kaufmann. arXiv:1711.02317](https://hal.inria.fr/hal-01629733),

---

### First model
- Discrete time t >= 1 and K radio channels (*e.g.*, 10) (*known*)

- D **dynamic** devices try to access the network *independently*
- S=S1+...+S_K **static** devices occupy the network : Si in each channel (*unknown*)

---

### Hypotheses (1/2)
#### Emission model
- Each device has the same *low* emission probability: each step, each device sends a packet with probability p. (this gives a duty cycle proportional to 1/p)

#### Background traffic bothering the dynamic devices
- Each static device uses only one channel.
- Their repartition is fixed in time.

---

### Hypotheses (2/2)
#### Dynamic radio reconfiguration
- Each **dynamic device decides the channel it uses to send every packet**.
- It has memory and computational capacity to implement simple **decision algorithm**.

#### Problem
- *Goal* : *maximize packet loss ratio* (= number of received `Ack`) in a *finite-space discrete-time Decision Making Problem*.
- *Solution ?* **Multi-Armed Bandit algorithms**, **decentralized** and used **independently** by each device.

---

### A naive strategy : uniformly random access
- **Uniformly random access**: dynamic devices choose uniformly their channel in the pull of K channels.
- Natural strategy, dead simple to implement.

- Simple analysis, in term of **successful transmission probability** (for every message from dynamic devices) : P(success|sent) = sum_{i=1...K} (1 - p/K)^(D-1) × (1-p)^Si × (1/K)

#### No learning
- Works fine only if all channels are similarly occupied, but **it cannot learn** to exploit the best (more free) channels.

---

### Optimal centralized strategy
- If an oracle can decide to affect $D_i$ dynamic devices to channel i, the **successful transmission probability** is: P(success|sent) = sum_{i=1...K} (1 - p)^(Di - 1) × (1 - p)^Si × Di/D.

- The oracle has to solve this **optimization problem**:
    + argmax_{D1,...,DK} sum_i^K Di (1-p)^(Si + Di - 1),
    + such that sum_i^K Di = D,
    + and Di >= 0 for all 1 <= i <= K.

- We solved this quasi-convex optimization problem with *Lagrange multipliers*, only numerically.
- → Very good performance, maximizing the transmission rate of all the D dynamic devices

#### But unrealistic
But **not achievable in practice**: no centralized control and no oracle!

#### Now let see *realistic decentralized approaches*
- → Machine Learning ?
    - → Reinforcement Learning ?
        - → *Multi-Armed Bandit* !

#### Greedy approximation
- Still very efficient
- More reasonable
- But still unachievable: this is a *baseline*

---

### Multi-Armed Bandit formulation
A dynamic device tries to collect *rewards* when transmitting :

- it transmits following a Bernoulli process (probability p of transmitting at each time step t),
- chooses a channel A(tau) in {1,...,K},
    + if `Ack` (no collision) → reward r(A(tau)) = 1,
    + if collision (no `Ack`) → reward r(A(tau)) = 0.

#### Reinforcement Learning interpretation
Maximize transmission rate = **maximize cumulated rewards**
max on algorithm A of sum_{tau=1}^{horizon} r(A(tau)).

---

### Upper Confidence Bound algorithm (UCB1)
Dynamic device keep tau number of sent packets, Tk(tau) selections of channel k, Xk(tau) successful transmission in channel k.

1. For the first K steps (tau=1,...,K), try each channel *once*.
2. Then for the next steps t > K :
    - Compute the index gk(tau) := Xk(tau)/Nk(tau) + sqrt(\log(tau) / (2 Nk(tau)),
    - Choose channel A(tau) = argmax gk(tau)$,
    - Update Tk(tau+1) and Xk(tau+1).

> References: [Lai & Robbins, 1985], [Auer et al, 2002], [Bubeck & Cesa-Bianchi, 2012].

---

### Thompson Sampling : Bayesian approach
A dynamic device assumes a stochastic hypothesis on the background traffic, modeled as Bernoulli distributions.

- Rewards rk(tau) are assumed to be *i.i.d.* samples from a Bernoulli distribution Bern(mu_k).

- A **binomial Bayesian posterior** is kept on the mean availability mu_k : Bin(1 + Xk(tau), 1 + Nk(tau) - Xk(tau)).
- Starts with a *uniform prior* : Bin(1, 1) ~ U([0,1]).

1. Each step tau >= 1, draw a sample from each posterior ik(tau) ~ Bin(ak(tau), bk(tau)),
2. Choose channel A(tau) = argmax ik(tau)$,
3. Update the posterior after receiving `Ack` or if collision.

> References: [Thompson, 1933], [Kaufmann et al, 2012].

---

### Experimental setting

#### Simulation parameters
- K = 10 channels,
- S + D = 10000 devices **in total**.
  Proportion of dynamic devices D/(S+D) varies,
- p = 10^-3 probability of emission, for all devices,
- Horizon = 10^6 time slots,  (about 100 messages by device)
- Various settings for (S1,...,SK) static devices repartition.

#### What do we show  (for static Si)
- After a short learning time, MAB algorithms are almost as efficient as the oracle solution !
- Never worse than the naive solution.
- Thompson sampling is more efficient than UCB.
- Stationary alg. outperform adversarial ones (UCB >> Exp3).

---

### 10% of dynamic devices

---

### 30% of dynamic devices

---

### Dependence on D/(S+D)

---

### Section 6

A brief presentation of a different approach...

Theoretical results for an easier model

---

### An easier model

#### Easy case
- M <= K dynamic devices **always communicating** (p=1).
- Still interesting: many mathematical and experimental results!

#### Two variants
- *With sensing*: Device first senses for presence of Primary Users (background traffic), then use `Ack` to detect collisions. Model the "classical" Opportunistic Spectrum Access problem. Not exactly suited for IoT networks like LoRa or SigFox, can model ZigBee, and can be analyzed mathematically... (*cf* Wassim's and Navik's theses, 2012, 2017)
- *Without sensing*: like our IoT model but smaller scale. Still very hard to analyze mathematically.

---

### Notations for this second model

#### Notations
- K channels, modeled as Bernoulli (0/1) distributions of mean mu_k, is a background traffic from *Primary Users*,
- M devices use channel Aj(t) in {1,...,K} at each time step,
- Reward: rj(t) := Y(Aj(t),t) 1(Cj(t)) = 1(uplink & `Ack`)
    + with sensing information Y(k,t) ~ Bern(mu_k),
    + collision for device j, Cj(t) = 1(*alone on arm Aj(t)*).

#### Goal : *decentralized* reinforcement learning optimization!
- Each player wants to **maximize its cumulated reward**,
- With no central control, and no exchange of information,
- Only possible if : each player converges to one of the M best arms, orthogonally (without collisions)

---

### Centralized regret
#### New measure of success
- Not the network throughput or collision probability,
- Now we study the **centralized regret**
  R_T(mu, M, rho) := sum_{k=1}^{M} mu_k^* T - Expectancy_mu [sum_{t=1}^T sum_{j=1}^M rj(t) ].

#### Two directions of analysis
- Clearly R_T = O(T), but we want a sub-linear regret
- *What is the best possible performance of a decentralized algorithm in this setting?*
    → **Lower Bound** on regret for **any** algorithm !
- *Is this algorithm efficient in this setting?*
    → **Upper Bound** on regret for **one** algorithm !

---

### Asymptotic Lower Bound on regret (1/2)

For any algorithm, decentralized or not, we have

R_T(mu, M, rho) = sum_{k in Mworst} (mu_M^* - mu_k) E_mu[Tk(T)] + sum_{k in Mbest} (mu_k - mu_M^*) E_mu[T-Tk(T)] + sum_{k=1}^{K} mu_k E_mu[C_k(T)].

#### Small regret can be attained if...
1. Devices can quickly identify the bad arms M-worst, and not play them too much (*number of sub-optimal selections*),
2. Devices can quickly identify the best arms, and most surely play them (*number of optimal non-selections*),
3. Devices can use orthogonal channels (*number of collisions*).

---

### Asymptotic Lower Bound on regret (2/2)

#### Lower-bounds
- The first term E_mu[Tk(T)],
  for sub-optimal arms selections, is lower-bounded,
  using technical information theory tools
  (Kullback-Leibler divergence, entropy),
- And we lower-bound collisions by... 0 : hard to do better!

#### Theorem 1  [Besson & Kaufmann, 2017].
- For any uniformly efficient decentralized policy, and any non-degenerated problem mu,
- lim inf R_T(mu, M, rho) / log(T) >= M sum_{k in Mworst} (mu_M^* - mu_k) / kl(mu_k, mu_M^*).

> Where kl(x,y) := x log(x/y) + (1 - x) log((1-x)/(1-y))$ is the binary Kullback-Leibler divergence.

---

### Illustration of the Lower Bound on regret

---

### Algorithms for this easier model

#### Building blocks : separate the two aspects
1. **MAB policy** to learn the best arms (use sensing Y(Aj(t),t),
2. **Orthogonalization scheme** to avoid collisions (use Cj(t)).

#### Many different proposals for *decentralized* learning policies
- Recent: MEGA and MusicalChair, [Avner & Mannor, 2015], [Shamir et al, 2016]
- State-of-the-art: **RhoRand policy** and variants, [Anandkumar et al, 2011]
- **Our proposals**: [Besson & Kaufmann, 2017]
    + With sensing: **RandTopM** and **MCTopM** are sort of mixes between RhoRand and MusicalChair, using UCB indexes or more efficient index policy (KL-UCB),
    + Without sensing: Selfish use a UCB index directly on the reward rj(t): like the first IoT model !

---

### Illustration of different algorithms

---

### Regret upper-bound for \MCTopM-KL-UCB

#### Theorem 2  [Besson & Kaufmann, 2017].
If all M players use MCTopM with KL-UCB,
then for any non-degenerated problem mu,
there exists a problem dependent constant G(M,mu), such that the regret satisfies:
R_T(mu, M, rho) <= G(M,mu) \log(T) + o(log T).

#### Remarks
- Hard to prove, we had to carefully design the MCTopM algorithm to conclude the proof,
- For the suboptimal selections, we *match our lower-bound* !
- We also *minimize the number of channel switching*: interesting as it costs energy,
- Not yet possible to know what is the best possible control of collisions...

---

### In this model
The Selfish decentralized approach = device don't use sensing, just learn on the receive acknowledgement,

- Like our first IoT model,
- It works fine in practice!
- Except... when it fails drastically!
- In small problems with M and K = 2 or 3, we found small probability of failures (*i.e.*, linear regret), and this prevents from having a generic upper-bound on regret for Selfish. Sadly...

---

### Illustration of failing cases for Selfish

---

### Perspectives
#### Theoretical results
- MAB algorithms have guarantees for *i.i.d. settings*,
- But here the collisions cancel the *i.i.d.* hypothesis,
- Not easy to obtain guarantees in this mixed setting (*i.i.d.* emissions process, "game theoretic" collisions).
- For OSA devices (always emitting), we obtained strong theoretical results,
- But harder for IoT devices with low duty-cycle...

#### Real-world experimental validation ?
- Radio experiments will help to validate this. Hard !

---

### Other directions of future work
- *More realistic emission model*: maybe driven by number of packets in a whole day, instead of emission probability.

- Validate this on a *larger experimental scale*.

- Extend the theoretical analysis to the large-scale IoT model, first with sensing (*e.g.*, models ZigBee networks), then without sensing (*e.g.*, LoRaWAN networks).

- And also conclude the Multi-Player OSA analysis (remove hypothesis that objects know M, allow arrival/departure of objects, non-stationarity of background traffic etc)

---

### Conclusion (1/2)
#### We showed
- Simple Multi-Armed Bandit algorithms, used in a Selfish approach by IoT devices in a crowded network, help to quickly learn the best possible repartition of dynamic devices in a fully decentralized and automatic way,
- For devices with sensing, smarter algorithms can be designed, and analyze carefully.
- Empirically, even if the collisions break the *i.i.d* hypothesis, stationary MAB algorithms (UCB, TS, KL-UCB) outperform more generic algorithms (adversarial, like Exp3).

---

### Conclusion (2/2)

#### But more work is still needed...
- **Theoretical guarantees** are still missing for the IoT model, and can be improved (slightly) for the OSA model.
- Maybe study **other emission models**.
- Implement this on **real-world radio devices** (*TestBed*).

#### **Thanks!**
*Any question?*