# Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings

## Abstract

Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, and Thomson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.

## 1Introduction

Unlicensed bands are more and more used and considered for mobile and LAN communication standards (WiFi, LTE-U), and for Internet of Things (IoT) standards for short-range (ZigBee, Z-Wave, Bluetooth) and long-range (LoRaWAN, SIGFOX, Ingenu, Weightless) communications [?]. This heavy use of unlicensed bands will cause performance drop, and could even compromise IoT promises.

Efficient Medium Access (MAC) policies allow devices to avoid interfering traffic and can significantly reduce the spectrum contention problem in unlicensed bands. As end-devices battery life is a key constraint of IoT networks, this leads to IoT protocols using as low signaling overhead as possible and simple ALOHA-based mechanisms. In this article, we analyze the performance of Multi-Armed Bandits (MAB) algorithms [?], used in combination with a time-frequency slotted ALOHA-based protocol. We consider the Upper-Confidence Bound () [?], and the Thompson-Sampling (TS) algorithms [?].

MAB learning has already been proposed in Cognitive Radio (CR) [?], and in particular, for sensing-based Dynamic Spectrum Access (DSA) in licensed bands [?]. Recently, TS and algorithms have been used for improving the spectrum access in (unlicensed) WiFi networks [?], and the algorithm was used in a unlicensed and frequency- and time-slotted IoT network [?]. Many recent works show that MAB algorithms work well for real-world radio signal. However, even with only one dynamic user using the learning algorithm, the background traffic or the traffic of the other devices is never really stationary or i.i.d (independent and identically distributed). In recent works like [?], several devices are using bandit algorithms, and the assumptions made by the stochastic bandit algorithms are not satisfied: as several agents learn simultaneously, their behavior is neither stationary nor i.i.d. As far as we know, we provide the first study to confirm robustness of the use of stochastic bandit algorithms for decision making in IoT networks with a large number of intelligent devices in the network, which makes the environment not stationary at all, violating the hypothesis required for mathematical proofs of bandit algorithms convergence and efficiency.

The aim of this article is to assess the potential gain of learning algorithms in IoT scenarios, even when the number of intelligent devices in the network increases, and the stochastic hypothesis is more and more questionable. To do that, we suppose an IoT network made of two types of devices: static devices that use only one channel (fixed in time), and dynamic devices that can choose the channel for each of their transmissions. Static devices form an interfering traffic, which could have been generated by devices using other standards as well. We first evaluate the probability of collision if dynamic devices randomly select channels (naive approach), and if a centralized controller optimally distribute them in channels (ideal approach). Then, these reference scenarios allow to evaluate the performance of and TS algorithms in a decentralized network, in terms of successful communication rate, as it reflects the network efficiency.

The rest of this article is organized as follows. The system model is introduced in Section 2. Reference policies are described in Section 3, and MAB algorithms are introduced in Section 4. Numerical results are presented in Section 5.

## 2System model and notations

As illustrated in Figure 1, we suppose a slotted protocol. All devices share a synchronized time, and know in advance the finite number of available RF channels. In each time slot, devices try to send packets to the unique Base Station, which listens continuously to all channels, following an ALOHA-based communication (no sensing). Each time slot is divided in two parts: first for uplink communications in which data packets are sent by end-devices to the base station. If only one packet is sent in this part of the slot, the base station can decode it and sends an acknowledgement to the device in the second part. If two or more devices send an uplink packet in the same slot, the uplink packets collide (i.e., there is a collision), and the acknowledgement Ack is not transmitted. This way, no collision can occur on the downlink messages, easing the analysis of collisions.

There are two types of end-devices in the network:

• Static end-devices have poor RF abilities, and each of them uses only one channel to communicate with the base station. Their choice is assumed to be fixed in time (stationary) and independent (i.i.d.). The traffic generated by these devices is considered as an interfering traffic for other devices.

• Dynamic (or smart) end-devices have richer RF abilities, they can use all the available channels, by quickly reconfiguring their RF transceiver on the fly. They can also store communication successes or failures they experienced in each channel, in order to change channel, possibly at every time slot.

There are channels, dynamic end-devices, and static devices. Furthermore, in channel there are static devices (so ). We focus on dense networks, in which the number of devices is very large compared to (about to , while is about to ). As this problem is only interesting if devices are able to communicate reasonably efficiently with the base station, we assume devices only communicate occasionally, i.e., with a low duty cycle, as it is always considered for IoT.

We suppose that all devices follow the same emission pattern, being fixed in time, and we choose to model it as a simple Bernoulli process: all devices have the same probability to send a packet in any (discrete) temporal slot, and we denote this probability1. The parameter essentially controls the frequency of communication for each device, once the time scale is fixed (i.e., real time during two messages), and is proportional to the duty cycle.

The goal is to design a simple sequential algorithm, to be applied identically by each dynamic device, in a fully distributed setting (each device runs its own algorithm, from its observations), in order to minimize collisions and maximize the fraction of successful transmissions of all the dynamic devices.

Before explaining how this goal presents similarity with a multi-armed bandit problem, we present some natural baseline policies (i.e., algorithms).

## 3Three reference policies

This section presents three different policies that will be used to assess the efficiency of the learning algorithms presented in the next section. The first one is naive but can be used in practice, while the two others are very efficient but require full knowledge on the system (i.e., an oracle) and are thus unpractical.

### 3.1Naive policy: Random Channel Selection

We derive here the probability of having a successful transmission, for a dynamic device, in the case where all the dynamic devices make a purely random channel selection (i.e., uniform on ).

In this case, for one dynamic device, a successful transmission happens if it is the only device to choose channel , at that time slot. The probability of successful transmission is computed as follows, because the static devices in each channel are assumed to be independent, and static and dynamic devices are assumed to not transmit at each time with a fixed probability :

All dynamic devices follow the same policy in this case, so the probability of transmitting at that time in channel for any dynamic device is , and there are other dynamic devices. As they are independent, the probability that no other dynamic device sent in is . And , by uniform choice on channels and the Bernoulli emission hypothesis. So . Thus we can conclude,

This expression is constant (in time), and easy to compute numerically, but comparing the successful transmission rate of any policy against this naive policy is important, as any efficient learning algorithm should outperform it.

One point we could justify more is the following. The probability of having no other dynamic devices sending in channel in the same time is . We justified this by saying that as all devices chose their channel uniformly, the observed Bernoulli process in channel has mean .

### 3.2(Unachievable) Optimal oracle policy

We investigate in this section the optimal policy that can be achieved if the dynamic devices have a perfect knowledge of everything, and a fully centralized decision making2 is possible. We want to find the stationary repartition of devices into channels that maximizes the probability of having a successful transmission.

If the oracle draws once uniformly at random a configuration of dynamic devices, with devices affected to channel is fixed (in time, i.e., stationary), then this probability is computed as before:

Consequently, the optimal allocation vector is the solution of the following real-valued constraint optimization problem :

• The objective function is quasi-convex [?] in each of its coordinates, on .

• The Lagrange multipliers method can be applied to the optimization problem, with a quasi-convex objective function , linear equality constraints and linear inequality constraints . Thanks to Theorem 1 from [?] for quasi-convex functions, the strong duality condition is satisfied in this case, so finding the saddle points will be enough to find the maximizers.

Where , and denotes the -Lambert function which is the reciprocal bijection of on [?]. Moreover, condition implies that the Lagrange multiplier is the solution of

Equation can be solved numerically, with simple one-dimensional root finding algorithms. Solving the optimization problem provides the optimal real number value for , which has to be rounded to find the optimal number of devices for channel : for , and .

### 3.3A greedy approach of the oracle strategy

We propose a sequential approximation of the optimal policy: the third solution is a sub-optimal naive policy, simple to set up, but also unpractical as it also needs an oracle. End-devices are iteratively inserted in the channels with the lowest load (i.e., the index minimizing at global time step ). Once the number of devices in each channel is computed, the probability of sending successfully a message is also given by equation . This is the policy that would be used by dynamic devices if they were inserted one after the other, and if they had a perfect knowledge of the channel loads.

## 4Sequential policies based on bandit algorithms

We now present the stochastic Multi-Armed Bandit (MAB) model, and the two stochastic MAB algorithms used in our experiments [?]. While the stochastic MAB model has been used to describe some aspects of Cognitive Radio systems, it is in principle not suitable for our IoT model, due to the non-stationarity of the channels occupancy caused by the learning policy used by dynamic objects.

### 4.1Stochastic Multi-Armed Bandits

A Multi-Armed Bandit problem is defined as follows [?]. There is a fixed number of levers, or “arms”, and a player has to choose one lever at each discrete time , denoted as . Selecting arm at time yields a (random) reward, , and the goal of the player is to maximize the sum of his rewards, .

A well-studied version of this problem is the so-called “stochastic” MAB, where the sequence of rewards drawn from a given arm is assumed to be independent and identically distributed (i.i.d) under some distribution , that has a mean . Several types of reward distributions have been considered, for example distributions that belong to a one-dimensional exponential family (e.g., Gaussian, Exponential, Poisson or Bernoulli distributions). We consider Bernoulli bandit models, in which , that is, and .

The problem parameters are unknown to the player, so to maximize his cumulated rewards, he must learn the distributions of the channels, to be able to progressively focus on the best arm (i.e., the arm with largest mean). This requires to tackle the so-called exploration-exploitation dilemma: a player has to try all arms a sufficient number of times to get a robust estimate of their qualities, while not selecting the worst arms too many times.

In a Cognitive Radio application, arms model the channels, and players are the dynamic end-devices. For example in the classical OSA setting with sensing [?], a single dynamic device (a player) sequentially tries to access channels (the arms), and collects a reward of 1 if the channel is available and 0 otherwise. So rewards represent the availability of channels, and the parameter represents the mean availability of channel .

Before discussing the relevance of a multi-armed bandit model for our IoT application, we present two bandit algorithms, UCB1 and Thompson Sampling, which both strongly rely on the assumption that rewards are i.i.d..

### 4.2The UCB1 algorithm

A naive approach could be to use an empirical mean estimator of the rewards for each channel, and select the channel with highest estimated mean at each time. This greedy approach is known to fail dramatically [?]. Indeed, with this policy, the selection of arms is highly dependent on the first draws, if the first transmission in a channel fails, the device will never use it again. Rather than relying on the empirical mean reward, Upper Confidence Bounds algorithms instead use a confidence interval on the unknown mean of each arm, which can be viewed as adding a “bonus” exploration to the empirical mean. They follow the “optimism-in-face-of-uncertainty” principle : at each step, they play according to the best model, as the statistically best possible arm (i.e., the highest upper confidence bound) is selected.

More formally, for one device, let be the number of times channel was selected up-to time . The empirical mean estimator of channel is defined as the mean reward obtained by selecting it up to time , . For , the confidence term is , giving the upper confidence bound , which is used by the device to decide the channel for communicating at time step : . is an index policy.

The algorithm uses a parameter , originally, set to [?], but empirically is known to work better (uniformly across problems), and is advised by the theory [?]. In our model, every dynamic device implements its own algorithm, independently. For one device, the time is the number of time it accessed the network (following its Bernoulli transmission process,i.e., its duty cycle), not the total number of time slots from the beginning, as rewards are only obtained after a transmission, and IoT objects only transmit sporadicly, due to low transmission duty cycles.

### 4.3Thompson Sampling

Thompson Sampling [?] was introduced in 1933 as the very first bandit algorithm, in the context of clinical trials (in which each arm models the efficacy of one treatment across patients). Given a prior distribution on the mean of each arm, the algorithm selects the next arm to draw based on samples from the conjugated posterior distribution, which for Bernoulli rewards is a Beta distribution.

A Beta prior (initially uniform) is assumed on , and at time the posterior is . After every channel selection, the posterior is updated to have and counting the number of successful and failed transmissions made on channel . So if the Ack message is received, , and , otherwise , and . Then, the decision is done by sampling an index for each arm, at each time step , from the arm posteriors: , and the chosen channel is simply the channel with highest index . For this reason, Thompson Sampling is a randomized index policy.

Thompson Sampling, although being very simple, is known to perform well for stochastic problems, for which it was proven to be asymptotically optimal [?]. It is known to be empirically efficient, and for these reasons it has been used successfully in various applications, including on problems from Cognitive Radio [?], and also in previous work on decentralized IoT-like networks [?].

### 4.4A bandit model for IoT

Our IoT application is challenging in that there are multiple players (the dynamic devices) interacting with the same arms (the channels), without any centralized communication (they do not even know the total number of dynamic devices).

Considered alone, each dynamic device implements a learning algorithm to play a bandit game, the device is consequently a smart device. In each time slot, if it has to communicate (which happens with probability ), then it chooses a channel and it receives a reward if the transmission is successful, otherwise. Each device aims at maximizing the sum of the rewards collected during its communication instants, which shall indeed maximize the fraction of successful transmissions. Besides the modified time scale (rewards are no longer collected at every time step), this looks like a bandit problem. However, it cannot be modeled as a stochastic MAB, as the rewards are clearly not i.i.d: they not only depend on the (stationary, i.i.d) behavior of the static devices, but also on the behavior of other smart devices, that is not stationary (because of learning).

Despite this, we show in the next section that running a stochastic bandit algorithm for each device based on its own rewards is surprisingly successful.

#### Multi-Player MAB with collision avoidance?

Another idea could be to try to use a multi-player MAB model, as proposed by [?], to describe our problem.

In that case, the static and dynamic devices effect is decoupled, and arms only model the availability of the channels in the absence of dynamic devices : they are i.i.d. with mean . Moreover, dynamic devices are assumed to be able to sense a channel before sending [?], and so communicate only if no static device is detected on the channel. The smart devices try to learn the arms with highest means, while coordinating to choose different arms, i.e., avoid collisions in their choice, in a decentralized manner. However, in this model it is assumed that the multiple agents can know that they experienced a collision with another agent, which is non-realistic for our problem at stake, as our model of smart device cannot do sensing nor differentiate collisions between smart and non-smart devices.

Instead of using MAB algorithms assuming a stochastic hypothesis on the system, we could try to use MAB algorithms designed to tackle a more general problem, that makes no hypothesis on the interfering traffic. The adversarial MAB algorithms is a broader family, and a well-known and efficient example is the algorithm [?]. Empirically, the algorithm turned out to perform worse than both and TS in the same experiments. Contrarily to the two stochastic algorithms, the use of is correctly justified, even in the non-stationary and non-i.i.d, as its performance guarantee are true in any setting. But it is not so surprising that it performs worse, as the theoretical performance guarantees of adversarial MAB algorithms are an order of magnitude worse than the one for stochastic ones. More is left on this aspect for our future work.

We should instead write , if represents the probability that no static and no other dynamic device chose to send a packet on channel , at that time .

## 5Experiments and numerical results

We suppose a network with end-devices, and one IoT base station. Each device sends packets following a Bernoulli process, of probability (e.g., this is realistic: one packet sent about every minutes, for time slots of ). The RF band is divided in channels. Each static device only uses one channel, and their uneven repartitionin the channels is: , to keep the same proportions when decreases. The dynamic devices have access to all the channels, and use learning algorithms. We simulate the network during discrete time slots, during which each device transmits on average packets (i.e., the learning time is about steps, for each algorithm).

Figure 3 presents the evolution of the successful transmission rate, as a function of time. The two MAB algorithms, and Thompson Sampling (TS), are compared against the naive random policy from below, and the two oracle policies (optimal and greedy) from above. The results are displayed when , , and of the traffic is generated by dynamic devices.

We can see in Figure 3 that the TS algorithm (in red) outperforms the algorithm (in blue), when the number of end-devices is below 50%. When the number of end-devices is higher, both algorithms have almost the same performance, and perform well after very few transmissions (quick convergence). Moreover, we can see in Figures ?, ?, and ? that both have better success rate than the random policy and the probability of successful transmission is between the oracle optimal and oracle suboptimal policies. For instance, for of dynamic devices, after about transmissions, using over the naive uniform policy improved the successful transmission rate from to , and using Thompson Sampling improved it to . Increasing the number of end-devices decreases the gap between the optimal and random policies: the more dynamic devices, the less useful are learning algorithms, and basically for networks with only dynamic devices, the random policy is as efficient as the optimal one, as seen in Figures ? and Figure 4.

To better assess the evolution of the optimal policy compared to the random one, we have displayed on Figure 4 the evolution of the gain, in term of successful transmissions rate, provided by the optimal oracle and the two learning policies, after time slots, i.e., about transmissions for each object. We can see that when the proportion of end-devices is low (e.g., of devices are dynamic), the optimal policy provides an improvement of compared to random channel selection. The TS algorithm always provides near-optimal performance, but the algorithm has a lowest rate of convergence and performs consequently worse after transmissions, for instance it only provides a gain of for the same proportion of dynamic devices ().

Figure 4 also shows that learning keeps near-optimal performance even when the proportion of devices becomes large. Note that when this proportion increases, the assumptions of a stochastic MAB model are clearly violated, and there is no justification for the efficiency of TS and algorithms. Hence, it is surprising to have near optimal performance with stochastic MAB algorithms applied to partly dynamic and fully dynamic scenarios.

## 6Conclusion

In this article, we proposed an evaluation of the performance of MAB learning algorithms in IoT networks, with a focus on the convergence of algorithms, in terms of successful transmission rates, when the proportion of intelligent dynamic devices changes. Concretely, increasing this probability allows to insert more objects in the same network, while maintaining a good Quality of Service. We show that and TS have near-optimal performance, even when their underlying i.i.d. assumption is violated by the many “intelligent” end-devices.

This is both a surprising and a very encouraging result, showing that application of bandit algorithms tailored for a stochastic model is still useful in broader settings. The fully decentralized application of classic stochastic MAB algorithms are almost as efficient as the best possible centralized policy in this setting, after a short learning period, even though the dynamic devices can not communicate with each others, and do not know the system parameters. We will investigate this behavior in order to understand it better theoretically. We will also experiment more with adversarial algorithms, to confirm that they work less efficiently than stochastic bandit algorithms in our non-stochastic setting.

Moreover, for sake of simplicity we supposed that all devices use the same standard. Our future work will consider more realistic interference scenarios and IoT networks, with, e.g., non-slotted time, more than one base station etc.

We presented a simple model for IoT networks, slotted in frequency and in (discrete) time, where some of the end-devices are assumed to be fixed in one channel and some can reconfigure themselves dynamically, and can possibly access a different channel for every message they want to send. In a LoRaWan-like communication protocol with acknowledgments (Ack), dynamic end-devices can use the fact that they received an Ack to know if there choice of channel was smart, and so we presented the use of Reinforcement Learning algorithm to help dynamic devices improve their performance in this setting, like for Opportunistic Spectrum Access in Cognitive Radio.

If there is not a lot dynamic devices, the stationary hypothesis of the other devices allow to model the problem as a Multi-Armed Bandit (MAB) problem, but this stochastic model is less and less accurate as the number of dynamic devices grows. We focused on decentralized learning, where each dynamic device makes its own decision based only on its previous observation, with no central control; and we applied two classical bandit algorithms, and Thompson Sampling, for the decision making for dynamic devices. Their performance in term of improvement of the probability of successful transmission, and they were compared against the best possible allocation (if a central control with full knowledge of the system was possible, computed theoretically and approximated numerically), and a naive fully uniform random spectrum access.

We proved that both learning algorithms always perform better than the naive allocation policy, and after a short learning period (at least messages for each devices), they are almost as efficient as the best allocation scheme, even when the proportion of dynamic devices is high, and even in the limit scenario when there is no static device. This is both a surprising and a very encouraging result, showing that application of bandit algorithms tailored for stochastic problem is still useful when the problem become less stochastic.

#### Acknowledgements

This work is supported by the French National Research Agency (ANR), under the projects SOGREEN (grant coded: N ANR-14-CE28-0025-02) and BADASS (N ANR-16-CE40-0002), by Région Bretagne, France, by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay.

Note :

The simulation code used for the experiments in Section 5 is for MATLAB or GNU Octave, and is open-sourced under the MIT License, at: https://Bitbucket.org/SCEE_IETR/MAB_for_Non-stochastic_IoT_networks.

### a.1Possible extensions of this model

This work could easily be generalized with two probabilities and , if static and dynamic devices had different transmission pattern, and less easily with one probability per device. Also, other emission pattern could be considered, instead of a Bernoulli process for each user. We prefer to consider that all devices have the same probability to transmit to keep the notation as simple as possible.

Another extension could be to not have a Bernoulli process (or any random process), but a fixed rate of transmission, e.g., one transmission a day. So additionally to deciding the channel for communication, each device has to also decide when to communicate. This is another direction of research, that we will investigate in the future.

### a.2Proof of Proposition

This short subsection contains the missing details of the proof of Proposition ?.

1. First, we need to justify that the objective function is quasi-convex, in each of its coordinates.

2. Then, we develop the computation of , as a closed form expression of the system parameters (), the repartition of static devices () and the Lagrange multiplier .

#### Quasi-convexity

• For , the function: is quasi-convex on , i.e., for any and (definition from Indeed, , and and . But also as , and the same holds for . So which is a convex combination of and , so smaller than the larger of the two values, and so .

• The function is quasi-convex in each of its coordinates, on , as a sum of component-wise quasi-convex functions (with , thanks to the first point).

#### Derivation of the Lagrange multiplier solution

Let us now prove the expression for given in .

• If , the Lagrangian is denoted , and its derivative w.r.t. is .

• So the gradient is canceling iff satisfies . Let this is equivalent to and with , we get .

• By using the -Lambert function [?], reciprocal of , we get . So the gradient of the Lagrangian is canceling iff , because has to be non-negative. This gives the -th coordinate of the unique saddle point of , and so the unique solution to the maximization problem, thanks to

### a.3Example of another simulation

• We can include one with a uniform repartition of static device, to check that learning does not help in this case, but does not decrease performance either.

• And we should include another example with a different scenarios, maybe (Rémi already did the simulations).

### Footnotes

1. In the experiments below, is about , because in a crowded network should be smaller than for all devices to communicate successfully (in average).
2. This optimal policy needs an oracle seeing the entire system, and affecting all the dynamic devices, once and for all, in order to avoid any signaling overhead.