Contextual Dueling Bandits (1502.06362v2)

Published 23 Feb 2015 in cs.LG

Abstract: We consider the problem of learning to choose actions using contextual information when provided with limited feedback in the form of relative pairwise comparisons. We study this problem in the dueling-bandits framework of Yue et al. (2009), which we extend to incorporate context. Roughly, the learner's goal is to find the best policy, or way of behaving, in some space of policies, although "best" is not always so clearly defined. Here, we propose a new and natural solution concept, rooted in game theory, called a von Neumann winner, a randomized policy that beats or ties every other policy. We show that this notion overcomes important limitations of existing solutions, particularly the Condorcet winner which has typically been used in the past, but which requires strong and often unrealistic assumptions. We then present three efficient algorithms for online learning in our setting, and for approximating a von Neumann winner from batch-like data. The first of these algorithms achieves particularly low regret, even when data is adversarial, although its time and space requirements are linear in the size of the policy space. The other two algorithms require time and space only logarithmic in the size of the policy space when provided access to an oracle for solving classification problems on the space.

Citations (109)

View on Semantic Scholar

Summary

An Overview of Contextual Dueling Bandits

The paper "Contextual Dueling Bandits" introduces an innovative framework to solve decision-making problems using partial feedback, a natural scenario in many real-world contexts like information retrieval and recommendation systems. The problem domain tackled is that of contextual dueling bandits, which extends traditional bandit problems to involve contextual information and relative comparisons instead of absolute feedback. Specifically, the researchers integrate a contextual approach into the dueling bandit framework, thereby expanding on the model initially proposed by Yue et al.

Key Contributions

The principal contribution of this work is the introduction and formalization of the von Neumann winner concept. This concept is a game-theoretic solution for selecting optimal actions based on a randomized policy that is at least as effective as each other available policy. Unlike the traditional Condorcet winner—which may not exist due to the non-transitive nature of preferences—the von Neumann winner always exists, offering a more practical and robust solution.

The paper details three algorithms for learning optimal actions in the contextual dueling bandit framework:

Sparring Algorithm with Exp4: This online algorithm sparring approach involves two independently running copies of the Exp4 multi-armed bandit algorithm playing against each other. It provides efficient regret bounds even when data is adversarial but requires resources linear in the policy space.
Algorithm Using Follow-the-Perturbed-Leader (FPL): This approach uses an extension of the FPL algorithm to compute an approximate von Neumann winner. It achieves decent computational efficiency by leveraging a reduced game matrix representation of the problem.
Projected Gradient Descent (PGD): This algorithm offers potential improvements by combining concepts from online gradient descent methods to solve the game-theoretic optimization problem of finding the von Neumann winner in large policy spaces. It effectively minimizes regret with time and space complexities logarithmic concerning the policy space size, assuming the presence of a classification oracle.

Implications

The implications of these findings are both practical and theoretical. Practically, the proposed algorithms have the potential to significantly improve systems that rely on feedback through pairwise comparisons, such as web search engines and recommender systems. Theoretically, the paper broadens the understanding of dueling bandit problems by moving beyond the Condorcet assumption, thereby accommodating more realistic and complex preference scenarios.

Furthermore, the integration of game-theoretic concepts into bandit literature through the notion of the von Neumann winner highlights an innovative cross-disciplinary approach. The outlined methods emphasize computational efficiency, thereby making them suitable for scenarios with vast policy spaces and limited feedback.

Future Directions

Future research could explore extending these algorithms to even more complex settings, such as incorporating dynamic or evolving contexts. Moreover, investigating the deployment of these methods in large-scale, real-world applications will be crucial for assessing scalability and versatility. Enhancements in classification oracle implementations may also yield further improvements in handling large policy spaces.

The paper's algorithms also open avenues for blending reinforcement learning and bandit theories to develop models capable of approximating solutions in uncertain environments—a notion that holds considerable promise for advancements in artificial intelligence and machine learning.

In summary, the work on Contextual Dueling Bandits presents a sophisticated and well-founded advancement in the area of bandit learning with relative feedback, with significant implications for theory and practice in dynamic decision-making scenarios.

Related Papers

Tweets

https://twitter.com/Dilip_Arumugam/status/1764667609797267759

YouTube

Show All Videos