% \documentclass[anon,12pt]{article} % Anonymized submission
\documentclass[12pt]{article} % Include author names
% http://www.jmlr.org/format/format.html
\usepackage{jmlr2e}

% The following packages will be automatically loaded:
% amsmath, amssymb, natbib, graphicx, url, algorithm2e

\usepackage{macrosArticle}
\usepackage{macrosText}
\usepackage{times}
\usepackage{array}
\usepackage{float}

% \author{\name Lilian Besson \email {Lilian}{.}{Besson}{@}{CentraleSupelec}{.}{fr} \\
%     \addr CentraleSup\'elec (campus of Rennes), IETR, SCEE Team,\\
%     Avenue de la Boulaie -- CS $47601$, F-$35576$ Cesson-S\'evign\'e, France
%     \AND
%     \name Emilie Kaufmann \email {emilie}{.}{kaufmann}{@}{univ}{-}{lille1}{.}{fr} \\
%     \addr CNRS \& Universit\'e de Lille, Inria SequeL team\\
%     UMR $9189$ -- CRIStAL,  F-$59000$ Lille, France
% }

% -----------------------------------------------------------------
% Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names}
% \jmlrheading{20}{2018}{1-48}{10/17}{04/18}{besson18a}{Lilian Besson and Emilie Kaufmann}
\jmlrshortheading{2018}{Lilian Besson and Emilie Kaufmann}
\firstpageno{1} % the pagenumber you are assigned to start with by the editor.


\begin{document}

\title{Structure and Sparsity of Stochastic Multi-Arm Bandits}

% -----------------------------------------------------------------
\editor{Kevin Murphy and Bernhard Sch{\"o}lkopf}
% FIXME uncomment the \editor part in the jmlr2e.sty

\author{\name Lilian Besson \email {Lilian}{.}{Besson}{@}{CentraleSupelec}{.}{fr} \\
\addr CentraleSup\'elec (campus of Rennes), IETR, SCEE Team,\\
Avenue de la Boulaie -- CS $47601$, F-$35576$ Cesson-S\'evign\'e, France
\AND
\name Emilie Kaufmann \email {emilie}{.}{kaufmann}{@}{univ}{-}{lille1}{.}{fr} \\
\addr CNRS \& Universit\'e de Lille, Inria SequeL team\\
UMR 9189 -- CRIStAL,  F-$59000$ Lille, France
% \AND
% \name Christophe Moy \email {Christophe}{.}{Moy}{@}{Univ}-{Rennes1}{.}{fr} \\
% \addr Universit\'e de Rennes 1, IETR, SCEE Team,\\
%   263 ave. du Général Leclerc -- CS $74205$, F-$35042$ Rennes, France
}

\date{}  % remove date

\maketitle

\maketitle


% -----------------------------------------------------------------
\begin{abstract}%   <- trailing '%' for backward compatibility of .sty file
    TODO

    % DEADLINE
    % - Paper submission deadline: 15 June 2018, 12am CET
    % \verb+https://ewrl.wordpress.com/ewrl14-2018/#submit+
\end{abstract}

% -----------------------------------------------------------------
\begin{keywords}
    Multi-Armed Bandits;
    Reinforcement Learning;
    Online Learning;
    TODO
\end{keywords}


% \iffalse
% \FIXME{Bring back table of content for the HAL version?}
\tableofcontents
% \fi


% -----------------------------------------------------------------
\section{Introduction}\label{sec:introduction}


Multi-Armed Bandit (MAB) problems are well-studied sequential decision making problems in which an agent repeatedly chooses an action (the ``arm'' of a one-armed bandit) in order to maximize some total reward \citep{Robbins52,LaiRobbins85}.
Initial motivation for their study came from the modeling of clinical trials, as early as 1933 with the seminal work of \cite{Thompson33}.
In this example, arms correspond to different treatments with unknown, random effect.
Since then, MAB models have been proved useful for many more applications, that range from
cognitive radio \citep{Jouini09} to
online content optimization (\eg, news article recommendation \citep{Li10contextual}, online advertising \citep{LiChapelle11}, A/B Testing \citep{Kaufmann14,Jamieson17ABTest}),
or portfolio optimization \citep{Sani2012risk}.

While the number of patients involved in a clinical study (and thus the number of treatments to select) is often decided in advance, in other contexts the total number of decisions to make (the horizon $T$) is unknown.
It may correspond to the total number of visitors of a website optimizing its displays for a certain period of time, or to the number of attempted communications in a smart radio device.
In such cases, it is thus crucial to devise \emph{anytime algorithms}, that is algorithms that do no rely on the knowledge of this horizon $T$ to sequentially select arms.
A general way to turn any base algorithm into an anytime algorithm is the use of the so-called Doubling Trick, first proposed by \cite{auer1995gambling}, that consists in repeatedly running the base algorithm with increasing horizons.
Motivated by the frequent use of this technique and the absence of a generic study of its effect on the algorithm's efficiency,
this paper investigates in details two families of doubling sequences (geometric and exponential), and shows that the former should be avoided for stochastic problems.


More formally, a MAB model is a set of $K$ arms, each arm $k$ being associated to a (unknown) \emph{reward stream} $(Y_{k,t})_{t\in\N}$.
Fix $T$ a finite (possibly unknown) horizon.
At each time step $t \in \{1,\dots,T\}$ an agent selects an arm $A(t) \in \{1,\dots,K\}$ and receives as a reward the current value of the associated reward stream, $r(t) := Y_{A(t), t}$.
The agent's decision strategy (or \emph{bandit algorithm})
$\cA_T := (A(t), t\in \{1,\dots,T\})$ is such that $A(t)$ can only rely on the past observations
% $\cF_{t-1} = (A(1),r(1),\dots,A(t-1),r(t-1))$,
$A(1),r(1),\dots,A(t-1),r(t-1)$,
on external randomness and (possibly) on the knowledge of the horizon $T$.
The objective of the agent is to find an algorithm $\cA$ that maximizes the expected cumulated rewards, where the expectation is taken over the possible randomness used by the algorithm and the possible randomness in the generation of the rewards stream.
In the oblivious case, in which the reward streams are independent of the algorithm's choice, this is equivalent to minimizing the \emph{regret}, defined as
\begin{equation}\label{eq:regret}
    R_T(\cA_T) := \max_{k \in \{1,\dots,K\}}
    % \mu^* T - \E_{\mu}\left[\sum_{t=1}^{T} r(t)\right]
    \E\left[\sum_{t=1}^{T} \left(Y_{k,t} - Y_{A(t),t}\right)\right].
    % = \mu^* T - \sum_{k=1}^K \mu_k \E_{\mu}\left[ T_k(T) \right], \
\end{equation}


This quantity, referred to as \emph{pseudo-regret} in \cite{Bubeck12}, quantifies the difference between the expected cumulated reward of the best fixed action, and that of the strategy $\cA_T$.
For the general adversarial bandit problem \citep{Auer02NonStochastic}, in which the rewards streams are arbitrary (picked by an adversary), a \emph{worst-case} lower bound has been given.
It says that for every algorithm, there exists (stochastic) reward streams such that the regret is larger than $(1/20)\sqrt{KT}$
\citep{Auer02NonStochastic}.
% \citep[Th.5.1]{Auer02NonStochastic}.
Besides, the \ExpThree{} algorithm has been shown to have a regret of order $\sqrt{KT\log(K)}$.


Much smaller regret may be obtained in \emph{stochastic} MAB models, in which the reward stream from each arm $k$ is assumed to be \iid, from some (unknown) distribution $\nu_k$, with mean $\mu_k$.
In that case, various algorithms have been proposed with \emph{problem-dependent} regret upper bounds of the form $C(\bm\nu) \log(T),$ where $C(\bm\nu)$ is a constant that only depend on the arms distributions.
Different assumptions on the arms distributions lead to different problem-dependent constants.
In particular, under some parametric assumptions (\eg, Gaussian distributions, exponential families), \emph{asymptotically optimal} algorithms have been proposed and analyzed (\eg, \KLUCB{} \citep{KLUCBJournal} or Thompson sampling \citep{AgrawalGoyal11,Kaufmann12Thompson}), for which the constant $C(\bm\nu)$ obtained in the regret upper bound matches exactly that of the lower bound given by \cite{LaiRobbins85}.
Under the non-parametric assumption that the $\nu_k$ are bounded in $[0,1]$, the regret of the \UCBone{} algorithm \citep{Auer02} is of the above form with $C(\bm\nu) = 8 \times \sum_{k : \mu_k > \mu^*} (\mu^* - \mu_k)^{-1}$, where $\mu^* = \max_k \mu_k$ is the mean of the best arm.
%\UCBone{} is referred to as \emph{order-optimal} as its regret has, up to a multiplicative constant, the right dependency in the gaps.
Like in this last example, all the available constants $C(\bm\nu)$ become very large on ``hard'' instances, in which some arms are very close to the best arm.
On such instances,  $C(\bm\nu)\log(T)$ may be much larger than the worst-case $(1/20)\sqrt{KT}$, and distribution-independent guarantees may actually be preferred.

The \MOSS{} algorithm, proposed by \cite{Audibert2009minimax}, is the first stochastic bandit algorithm
to enjoy a problem-dependent logarithmic regret
and to be optimal in a \emph{minimax} sense, as its regret is proved to be upper bounded by $\sqrt{KT}$, for bandit models with rewards in $[0,1]$.
However the corresponding constant $C(\bm \nu)$ is proportional to $K/\Delta_{\min}$, where $\Delta_{\min} = \min_{k} (\mu^*-\mu_k)$ is the minimal gap, which worsen the constant of \UCBone{}.
Another drawback of \MOSS{} is that it is \emph{not} anytime.
%
These two shortcoming have been overcame recently in two different works.
On the one hand, the \MOSSAnytime{} algorithm \citep{Degenne16} is minimax optimal and anytime, but its problem-dependent regret does not improve that of \MOSS.
On the other hand, the \KLUCBpp{} algorithm \citep{Menard17} is simultaneously minimax optimal and asymptotically optimal (\ie, it has the best problem-dependent constant $C(\bm\nu)$), but it is not anytime. A natural question is thus to know whether a Doubling Trick could overcome this limitation.


This question is the starting point of our comprehensive study of the Doubling Trick: can a single Doubling Trick be used to preserve both problem-dependent (logarithmic) regret and minimax (square-root) regret?
We answer this question partially, by showing that two different types of Doubling Trick may actually be needed.
In this paper, we investigate how algorithms enjoying regret guarantees of the generic form
\begin{equation}\label{eq:genericWithLogT}
    \forall T \geq 1,\;\;\;\; R_T(\cA_T) \leq c \; T^{\gamma} (\log(T))^{\delta} + o( T^{\gamma} \left(\log(T))^{\delta} \right)
\end{equation}
may be turned into an anytime algorithm enjoying \emph{similar} regret guarantees with an appropriate Doubling Trick.
%
This does not come for free, and we exhibit a ``price of Doubling Trick'', that is a constant factor larger than $1$, referred to as a \emph{constant manipulative overhead}.


% Outline of this article
%\paragraph{Outline}
The rest of the paper is organized as follows.
The Doubling Trick is formally defined in Section~\ref{sec:genericAlgo}, along with a generic tool for its analysis.
In Section~\ref{sec:geometricDoublingTrick}, we present upper and lower bounds on the regret of algorithms to which a geometric Doubling Trick is applied.
Section~\ref{sec:exponentialDoublingTrick} investigates regret guarantees that can be obtained for a ``faster'' exponential Doubling Trick.
Experimental results are then reported in Section~\ref{sec:experiments}.
Complementary elements of proofs are deferred to the appendix.
% We give missing proofs and additional results in Appendix~\ref{sec:missingproofs}, an heuristic $\DTnr$ improving $\DTr$ in App.~\ref{sec:DTnr} along with more experimental results in App.~\ref{sec:otherExperiments}, and other technical tools are given in App.~\ref{sec:otherproofs}.


% -----------------------------------------------------------------
\section{FIXME}\label{sec:FIXME}

TODO


% -----------------------------------------------------------------
\section{FIXME}\label{sec:FIXME}

TODO


% -----------------------------------------------------------------
\section{FIXME}\label{sec:FIXME}

TODO


% -----------------------------------------------------------------
\section{Numerical Experiments}\label{sec:experiments}

TODO


% -----------------------------------------------------------------
\section{Conclusion}\label{sec:conclusion}


TODO

% \FIXME{More future work ?}


% %
% % Regular plots of centralized regrets
% %
% \begin{figure}[!h]
%     \centering
%     % \begin{subfigure}[!h]{0.49\textwidth}
%     \includegraphics[width=0.93\textwidth]{figures/main____env1-1_2223860464453456415.pdf}
%     % \end{subfigure}
%     % % ~
%     % \begin{subfigure}[!h]{0.49\textwidth}
%     %   \includegraphics[width=1.00\textwidth]{XXX.pdf}
%     % \end{subfigure}
%     \caption{Regret for $K=9$ Gaussian arms $\cN(\mu, 1)$, horizon $T=45678$, $n=1000$ repetitions and $\boldsymbol{\mu}$ taken uniformly in $[-5,5]^K$ and variance $V=1$. On ``hard'' problems like this one, both \UCB{} and \AFHG{} perform similarly and poorly \emph{w.r.t.} to the lower bound from \cite{LaiRobbins85}. As before we check that geometric doubling ($b=2$) and slow exponential doubling ($b=1.1$) are too slow, but a fast enough doubling sequence does give reasonable performance for the anytime \AFHG{} obtained by Doubling Trick.}
%     \label{fig:gaussianBandits_DoublingTrick_Restart_fixedProblem}
%     % \vspace*{-15pt}  % XXX remove if problem
% \end{figure}


% % \FIXME{For a newpage to not have the references starting in a ugly place}
\clearpage
\newpage

% -----------------------------------------------------------------
% Acknowledgments---Will not appear in anonymized version
\acks{
    This work is supported
    by the French National Research Agency (ANR), under the project BADASS (grant coded: \texttt{N ANR-16-CE40-0002}),
    % by CentraleSupélec,
    % by the CNRS, under the PEPS project BIO,
    by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay.
    %
    %Thanks to Odalric-Ambrym Maillard and Florian Strub at Inria Lille for useful discussions.
    %
    % The authors wish to thank the anonymous reviewers for their valuable comments that helped to improve the paper.
}

% -----------------------------------------------------------------
% \phantom{\cite{Lattimore16b}}

\bibliography{sparseBandits}
% \nocite{*}  % XXX remove at the end, to only include quoted references!


% \iffalse
\vskip 0.2in
% XXX Remove if problem?
\hr{}
% \vfill{}
\emph{Note}:
    the simulation code used for the experiments is using Python 3.
    % https://www.Python.org/
    It is open-sourced at \verb|https://GitHub.com/SMPyBandits/SMPyBandits|
    %  but is available upon request,
    and fully documented at \newline
    \verb|https://SMPyBandits.GitHub.io|.
    %~\citep{SMPyBandits}.
% \fi
% \hr{}

\clearpage
\newpage

% -----------------------------------------------------------------
\appendix


% -----------------------------------------------------------------
\section{Omitted Proofs}\label{sec:missingproofs}

We include here the proofs omitted in the main document.


% \hr{}

\end{document}