Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access


Multi-armed bandit algorithms have been recently studied and evaluated for Cognitive Radio (CR), especially in the context of Opportunistic Spectrum Access (OSA) [?]. Several solutions have been explored based on various models, but it is hard to exactly predict which could be the best for real-world conditions at every instants. Hence, expert aggregation algorithms can be useful to select on the run the best algorithm for a specific situation. Aggregation algorithms, such as Exp4 dating back from 2002 [?], have never been used for OSA learning, and we show that it appears empirically sub-efficient when applied to simple stochastic problems. In this article, we present an improved variant, called Aggregator. For synthetic OSA problems modeled as Multi-Armed Bandit (MAB) problems, simulation results are presented to demonstrate its empirical efficiency. We combine classical algorithms, such as Thompson sampling, Upper-Confidence Bounds algorithms ( and variants), and Bayesian or Kullback-Leibler UCB. Our algorithm offers good performance compared to state-of-the-art algorithms (Exp4, CORRAL or LearnExp [?]), and appears as a robust approach to select on the run the best algorithm for any stochastic MAB problem, being more realistic to real-world radio settings than any tuning-based approach.


Cognitive Radio (CR), introduced in [?], states that a radio, by collecting information about its environment, can dynamically reconfigure itself in order to improve its functionality regarding various metrics. One of the main direction of research, called Dynamic Spectrum Access [?], is focused on the spectrum access when devices reconfigure themselves by simply changing the frequency of their wireless communication. The model of Opportunistic Spectrum Access (OSA) for CR considers one Secondary User (SU) trying to use a licensed radio network occupied by Primary Users (PU). The network usage from the PU determines the availability patterns of the radio channels, and the goal of the SU is to communicate as efficiently as possible, without interfering with the PU. Thus at each step, a SU first senses one channel, and only transmits if this channel is unoccupied by a PU.

A common simple model in the literature is to describe the PU impact on the availability of the channels in the following way: channels are independent and identically distributed (i.i.d.), and their qualities follow parametric distributions, e.g., Bernoulli of means for availabilities when dealing with binary sensing feedback. The SU has to select the best expected channel each time to maximize its throughput: if successful communications are seen as rewards, the SU has to maximize its cumulative rewards, as in the Multi-Armed Bandit (MAB) problem [?].

MAB learning algorithms have been shown to be useful for the OSA setting [?], and algorithms and other variants (e.g., -or Bayes-UCB, [?]) have been successfully applied to both numerical and physically simulated CR problems [?]. The performance of such learning algorithm can be measured by different criteria. For example, it is common in the bandit literature to study the regret [?] which compares the loss in rewards between the algorithm and the full-knowledge strategy which always picks the best arm, i.e., the most available of mean . Good algorithms are expected to have slow-growing expected regret, but other criterion include the best arm pull frequency, or the throughput of the SU.

Many different learning algorithms have been proposed by the machine learning community, and most of them depend on several parameters, for instance for , the prior for Thompson sampling or BayesUCB, the function for - etc. Every time a new MAB algorithm is introduced, it is compared and benchmarked on some bandit instance, parameterized by , usually by focusing on its expected regret . For a known and specific instance, simulations help to select the best algorithm in a pool of algorithms. But when one wants to tackle an unknown real-world problem, one expects to be efficient against any problem, of any kind, size and complexity: ideally one would like to use an algorithm that can be applied identically against any problem. To choose the best algorithm, two approaches can be followed: either extensive benchmarks are done beforehand – if this is possible – to select the algorithm and its optimal parameters, or an adaptive algorithm is used to learn on the fly its parameters. We present a simple adaptive solution, that aggregates several learning algorithms in parallel and adaptively chooses which one to trust the most.

This paper is organized as follows: our OSA model is described in Section 2, and MAB learning algorithms are briefly presented in Section 3. We explain in Section 4 how to combine such algorithms for aggregation. Our proposed algorithm, called Aggregator, is detailed in Section 4.1, with numerical experiments presented in Section 5, comparing the regret of several algorithms against different aggregation algorithms. Theoretical guarantees are shortly discussed in Section 6, and Section 7 concludes.

2OSA Model for Cognitive Radio

We consider radio channels, also called arms, of different characteristics, unknown to the user. The radio protocol is slotted in both time and frequency, meaning that at each time step , the Secondary User (SU) tries to communicate in a channel . In the OSA model, the SU first senses one channel at a time, and can use it to communicate only if it was sensed free from any PU (they have full priority over the SU).

In the stochastic model considered in this paper, after choosing the arm , it is assumed that the sensing provides a reward , randomly drawn from a certain distribution depending on the arm index. Rewards are assumed to be bounded in , and generally they follow one-parameter exponential families. We present our algorithm by restricting to Bernoulli distributions1, for sake of simplicity, meaning that arm has a parameter and rewards are drawn from , , which can be simply interpreted by the SU: it is if the channel is not used by any PU during the time slot , and is otherwise. Aggregation algorithms usually deal with losses rather rewards [?], so we also introduce the quantity .

3Classical MAB Algorithms : , -, Ts

An algorithm has to maximize its cumulative rewards, by choosing the arm at time , or, equivalently, to minimize its pseudo-regret is defined as

where is the mean of the best arm: , for an expectation taken on the arm distributions. This pseudo-regret is random so we prefer to focus on the expected regret, .

The algorithm [?] selects the arm with highest index, where each index is an Upper Confidence Bound on the unknown mean, computed as the sum of the empirical mean of each arm (if 1(A(t)=k) and ), and an exploration term defined by . is known to yield logarithmic regret on all problems, but on some specific instance a better value of may be found empirically [?].

The - algorithm is similar, but instead it uses a Kullback-Leibler divergence function to compute a statistically better UCB [?]. As a different KL function exists for each different exponential family, this algorithm also requires a prior knowledge of the problem to be efficient.

The Thompson sampling (TS) [?] algorithm is Bayesian: it maintains a posterior distribution on each means (e.g., Beta posteriors for Bernoulli arms), updated after each observation, and chooses an arm by sampling a random mean from each posterior and playing the arm with highest mean. The posterior distribution has to be chosen according to the exponential family as the conjugated posterior.

Both , - and TS have been proved to have logarithmic regrets [?], meaning that , in Bernoulli bandit problems and also under more general assumptions. The constant in the big- is important, and [?] showed that in this setting, the regret of any (uniformly efficient) algorithm is at least when is large, for a constant depending only on the problem instance : (with a unique best arm), where is the binary KL divergence between two Bernoulli distributions of parameters and .

4Aggregating Bandit Algorithms

We assume to have MAB algorithms, , and let be an aggregation algorithm, which runs the algorithms in parallel (with the same slotted time), and use them to choose its channels based on a voting from their decisions. depends on a pool of algorithms and a set of parameters. We would like that performs almost as well as the best of the , with a good choice of its parameters, independently of the MAB problem. Ideally should perform similarly to the best of the . To simplify the presentation, we only aggregate bandit algorithms that give deterministic recommendations: one arm is chosen with probability and the others with probability . However, both Exp4 and Aggregator can be adapted to aggregate randomized bandit algorithms, i.e., algorithms that output a probability distribution over the arms at each time step, and draw the next selected arm according to this distribution.

The aggregation algorithm maintains a probability distribution on the algorithms , starting from a uniform distribution: is the probability of trusting the decision made by algorithm at time . then simply performs a weighted vote on its algorithms: it decides whom to trust by sampling from , then follows ’s decision. The main questions are then to know what observations (i.e., arms and rewards) should be given as feedback to which algorithms, and how to update the trusts at each step, and our proposal Aggregator differs from Exp4 on these very points.

4.1The Aggregator algorithm

Our proposed Aggregator is detailed in Algorithm 1 below.

At every time step, after having observed a loss for its chosen action , the algorithm updates the trust probabilities from to by a multiplicative exponential factor (using the learning rate and the unbiased loss). Only the algorithms who advised the last decision get their trust updated, in order to trust more the “reliable” algorithms. The loss estimate is unbiased in the following sense. If one had access to the rewards (or the losses ) for all arms , the loss incurred by algorithm at time would be . This quantity can only be observed for those algorithms for which . However, by dividing by the probability of observing this recommendation, one obtains an unbiased estimate of . More precisely,


satisfies , for all , where the expectation is taken conditionally to the history of observations up to round , . Observe that for all algorithms such that , and otherwise.

An important feature of Aggregator is the feedback provided to each underlying bandit algorithm, upon the observation of arm . Rather than updating only the trusted algorithms (that is the algorithms which would have drawn arm ) with the observed reward , we found that updating each algorithm with the (original) loss observed for arm improves the performance drastically. As expected, the more feedback they get, the faster the underlying algorithms learn, and the better the aggregation algorithm will be [?].

Regarding the update of , one can note that the trust probabilities are not all updated before the normalization step, and an alternative would be to increase if and to decrease it otherwise. It would not be so different, as there is a final renormalization step, and empirically this variation has little impact on the performance of Aggregator.

4.2 Aggregator versus Exp4

The Exp4 algorithm (see, e.g. [?]) is similar to Aggregator, presented in Algorithm ?, but differs in the two following points. First, is sampled first and the arm chosen by is trusted, whereas Aggregator needs to listen to the decisions to perform the updates on , and Exp4 gives back an observation (arm, reward) only to the last trusted algorithm whereas Aggregator gives it to all algorithms. Second, after having computed the loss estimate , Exp4 updates the estimated cumulative loss for each algorithm, . Instead of updating multiplicatively as we do for our proposal, Exp4 recomputes it, proportionally to .

The sequence of non-negative learning rates used by Exp4 can be arbitrary, it can be constant but should be non-increasing [?]. If the horizon is known (and fixed), the best choice is given by . However, for real-world communication problems, it is unrealistic to assume a fixed and known time horizon, so we prefer the alternative horizon-free choice of learning rates, suggested by [?]. We compare both approaches empirically, and the second one usually performs better. We also stick to this choice of for Aggregator.

5Experiments on Simulated MAB Problems

Figure 1: On a simple Bernoulli problem (semi\log-y scale).
Figure 1: On a “simple” Bernoulli problem (semi- scale).
Figure 2: On a harder Bernoulli problem, they all have similar performances, except LearnExp.
Figure 2: On a “harder” Bernoulli problem, they all have similar performances, except LearnExp.
Figure 3: On an easy Gaussian problem, only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling.
Figure 3: On an “easy” Gaussian problem, only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling.

We focus on i.i.d. MAB problems, with channels2. For Bernoulli problem, the first one uses , and the second one is divided in three groups: 2 very bad arms (), 5 average arms ( to ) and 3 very good arms (). The horizon is set to (but its value is unknown to all algorithms), and simulations are repeated times, to estimate the expected regret. This empirical estimation of the expected regret is plotted below, as a function of , comparing some algorithms (for ), and their aggregation with Aggregator (displayed in orange bold). The Lai & Robbins’ logarithmic lower-bound [?] is also plotted, and it is crucial to note that it is only asymptotic and to not be surprised by having regret curves smaller than the lower-bound (e.g., for the easier Bernoulli problem). Note that for each of the simulations, we choose to generate all the rewards beforehand, i.e., one full matrix for every repetition, in order to compare the algorithms on the same realizations of the MAB problem.

We compare our Aggregator algorithm, as well as other aggregation algorithms, Exp4, CORRAL and LearnExp (both with default parameters) [?]. The aggregated algorithms consist in a naive uniform exploration (to have at least one algorithm with bad performances, i.e. linear regret, but it is not included in the plots), with , three - with Bernoulli, Gaussian and exponential functions, and BayesUCB and Thompson sampling with uniform prior.

Figures Figure 1 and Figure 4 are in semi- scale, this helps to see that the best algorithms can be an order of magnitude more efficient than the worst, and the Aggregator performs similarly to the best ones, when the other aggregation algorithms are usually amongst the worst. Figure 5 is in semi- scale to show that the regret of efficient algorithms are indeed logarithmic.

Figure 4: On a harder problem, mixing Bernoulli, Gaussian, Exponential arms, with 3 arms of each types with the same mean.
Figure 4: On a harder problem, mixing Bernoulli, Gaussian, Exponential arms, with 3 arms of each types with the same mean.
Figure 5: The semi\log-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard mixed problem (cf. Figure ).
Figure 5: The semi- scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard “mixed” problem (cf. Figure ).

For Bernoulli problems (Figures Figure 1 and Figure 2), UCB with , Thompson sampling, BayesUCB and - (with the binary function) all perform similarly, and Aggregator is found to be as efficient as all of them. For Gaussian and exponential arms, rewards are truncated into , and the variance of Gaussian distributions is fixed to for all arms, and can be known to the algorithms (the function is adapted to this one-dimensional exponential family). Figure 3 uses only Gaussian arms, with a large gap between their means and a relatively small variance, giving an “easy” problem. And Figure 4 shows a considerably harder “mixed” problem, when the distributions are no longer in the same one-dimensional exponential family and so the Lai & Robbins’ lower-bound no longer holds (even if there still exists a lower-bound).

For each of the 4 problems considered, the Aggregator algorithm with default option (broadcast loss to all players) is the best of all the aggregation algorithms, and its regret is very close to the best of the aggregated algorithms. Especially in difficult problems with mixed or unknown distributions, Aggregator showed to be more efficient that Exp4 and orders of magnitude better than the other reference aggregation algorithms LearnExp and CORRAL (see Figures Figure 4 and Figure 5).

6Theoretical Guarantees

The Aggregator does not have satisfying theoretical guarantees in terms of regret yet, unlike many bandit algorithms. Another notion, the adversarial regret, denoted by , measures the difference in terms of rewards, between the aggregation algorithm and the best aggregated algorithm . This is in contrast with the (classical) regret, which measure the difference with the best fixed-arm strategy (Eq. ). Thus, even if the aggregated algorithms have logarithmic (classical) regret, having an adversarial regret scaling as does not permit to exhibit a logarithmic (classical) regret for the aggregation algorithm. Under some additional hypotheses, [?] proves that Exp4 satisfies a bound on adversarial regret, , with the good choice of the learning rate sequence . Our proposed algorithm follows quite closely the architecture of Exp4, and a similar bound for Aggregator is expected to hold.

This would be a first theoretical guarantee, but not satisfactory as simple algorithms like UCB have regrets scaling as [?], not . Regret bounds in several different settings are proved for the CORRAL algorithm [?], but no logarithmic upper-bound can be obtained from their technique, even in the simplest setting of stochastic bandits. However, Aggregator always seems to have a (finite-horizon) logarithmic regret in all the experiments we performed, for both Bernoulli and non-Bernoulli problems (e.g., Gaussian, exponential and Poisson distributions). Further theoretical developments are left as future work.


We presented the use of aggregation algorithms in the context of Opportunistic Spectrum Access for Cognitive Radio, especially for the real-world setting of unknown problem instances, when tuning parameters before-hand is no longer possible and an adaptive algorithm is preferable. Our proposed Aggregator was presented in details, and we also highlighted its differences with Exp4.

We realized experiments on simple MAB problems already used in the community of bandit algorithms for OSA [?], and the simulations results showed that Aggregator works as expected, being able to identify on the fly the best algorithm to trust for a specific problem. Experiments on problems mixing different families of distributions were also presented, with similar conclusions in favor of Aggregator. It is not presented in this article, but our proposed algorithm also works well in dynamic scenarios, in which the distribution of the arms can change abruptly at some time, and appears to be more robust than simple non-aggregated algorithms. Exp4 has theoretical guarantees in terms of adversarial regret, and even if the same result could hold for Aggregator, results in terms of classical regret are yet to be proved. Empirically, Aggregator showed to always have a logarithmic regret if it aggregates algorithms with logarithmic regrets (like UCB, -, Thompson sampling, BayesUCB etc). It usually succeeds to be close to the best of the aggregated algorithms, both in term of regret and best arm pull frequency. As expected, the Aggregator is never able to outperform any of the aggregated algorithms, but this was an over-optimistic goal. What matters the most is that, empirically, Aggregator is able to quickly discover on the fly the best algorithms to trust, and then performs almost as well as if it was following it from the beginning.

Our Aggregator algorithm can probably be rewritten as an Online Mirror Descent, as Exp4 and CORRAL, but this does not appear useful as in the case of CORRAL the analysis cannot bring a logarithmic bound on the regret even by aggregating asymptotically optimal algorithms. We will continue investigating regret bounds for Aggregator, and other directions include possible applications to the non-stochastic case (e.g., rested or restless Markovian problems, like it was very recently studied in [?]).


The authors would like to thank Odalric-Ambrym Maillard at Inria Lille, Rémi Bonnefoi and Quentin Bodinier at CentraleSupélec Rennes, for fruitful discussions.

Note: the source code (Python or ) used for the simulations and the figures is open-sourced, at, and a full documentation is available at


  1. The model is similar for other distributions, and we also experimented and tested our proposal Aggregator with Gaussian, exponential and Poisson distributions, with unbounded or finite in support, and similar conclusions were observed. Non-discrete rewards are interpreted as a relative communication efficiency, but we do not cover this aspect here.
  2. Similar behaviors are observed for any not-too-large values of , we tried up-to and the same results were obtained.