Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal,Navin Goyal

Published 2012 in International Conference on Machine Learning

ABSTRACT

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of O(d2/e√T1+e) in time T for any 0 < e < 1, where d is the dimension of each context vector and e is a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of Ω(d√T) for this problem. This essentially solves a COLT open problem of Chapelle and Li [COLT 2012].

PUBLICATION RECORD

  • Publication year

    2012

  • Venue

    International Conference on Machine Learning

  • Publication date

    2012-09-14

  • Fields of study

    Mathematics, Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-32 of 32 references · Page 1 of 1

CITED BY

Showing 1-100 of 1095 citing papers · Page 1 of 11