Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 95 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 90 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Kimi K2 192 tok/s Pro
2000 character limit reached

Large Language Model-Enhanced Multi-Armed Bandits (2502.01118v1)

Published 3 Feb 2025 in cs.LG and cs.AI

Abstract: LLMs have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces hybrid algorithms that combine classical MAB methods with LLM-based prediction to improve reward estimation.
  • The paper details three novel variants—TS-LLM, RO-LLM, and TS-LLM-DB—each employing decaying temperature schedules and tailored exploration strategies.
  • The paper demonstrates through experiments lower cumulative regret across synthetic and real-world tasks compared to direct LLM arm selection.

This paper addresses the limitations of directly using LLMs for arm selection in Multi-Armed Bandit (MAB) problems, an approach previously shown to be suboptimal in many scenarios. Instead of replacing classical MAB algorithms with LLMs, the authors propose enhancing them by leveraging LLMs' strong in-context learning capabilities for specific sub-tasks, particularly reward or preference prediction.

The core idea is to use a classical MAB algorithm as the high-level decision-making framework while employing an LLM as a powerful, flexible predictor within that framework. This allows the algorithms to benefit from the theoretical guarantees and exploration/exploitation strategies of classical methods while using the LLM to handle complex, unknown reward functions without requiring explicit function class specification or fine-tuning.

The paper introduces three specific LLM-enhanced algorithms:

  1. Thompson Sampling with LLM (TS-LLM): For classical stochastic MAB.
    • Implementation: In each iteration, the LLM predicts the reward for each arm based on the history of observed arm features and rewards. The arm with the highest predicted reward is selected.
    • Exploration/Exploitation: Relies on the inherent randomness in LLM outputs for exploration. A decaying temperature schedule for the LLM is used to transition from high exploration (large temperature) in initial stages to high exploitation (low temperature) later on.
    • Practical Details: The prompt includes the history of (arm feature, reward) pairs and the current arm feature to be predicted. The output format is specified to extract the reward value.
  2. Regression Oracle-Based Bandit with LLM (RO-LLM): Also for classical stochastic MAB, based on the SquareCB algorithm.
    • Implementation: The LLM predicts the loss (negated reward) for each arm based on history. The arm with the minimum predicted loss (jtj_t) is identified. A sampling distribution ptp_t over arms is constructed based on the predicted losses, where arms predicted to have lower loss (especially jtj_t) have higher sampling probability. The next arm is sampled from ptp_t.
    • Exploration/Exploitation: Uses an explicit exploration mechanism embedded in the sampling distribution ptp_t, controlled by parameter γ\gamma.
    • Practical Details: The LLM temperature is set to 0 to obtain deterministic loss predictions, as exploration is handled explicitly by the algorithm. The prompt structure is similar to TS-LLM but uses losses.
  3. Thompson Sampling with LLM for Dueling Bandits (TS-LLM-DB): For dueling bandits where binary preference feedback between arm pairs is observed.
    • Implementation: The LLM predicts the probability that one arm is preferred over another based on the history of (arm pair features, preference feedback).
    • Arm Pair Selection: The first arm (it,1i_{t,1}) is selected by approximately maximizing the Borda function, estimated by averaging the LLM's predicted preference probability of arm ii over NN uniformly sampled arms. The second arm (it,2i_{t,2}) is selected by maximizing the predicted probability that arm jj is preferred over the chosen first arm it,1i_{t,1}.
    • Practical Details: Features for the LLM prompt are either the difference (xixjx_i - x_j) or concatenation ([xi,xj][x_i, x_j]) of the arm features, depending on the suspected form (linear/non-linear) of the latent reward function. A decaying temperature schedule is used, potentially with lower temperatures for selecting the first arm (exploitation-focused) than the second (balancing exploration/exploitation). The parameter NN influences the quality of the Borda function approximation but increases LLM calls and cost.

The paper evaluates these algorithms through extensive experiments:

  • Synthetic MAB: Experiments with linear, square, sinusoidal, and Gaussian Process reward functions demonstrate that both TS-LLM and RO-LLM achieve lower cumulative regret compared to baseline methods that directly instruct the LLM to select arms. TS-LLM generally performs better, attributed to its exploration strategy via LLM randomness, while RO-LLM shows lower variance due to deterministic predictions.
  • Synthetic Dueling Bandits: TS-LLM-DB shows significantly better performance (lower cumulative regret of the first arm) than random search in tasks with linear and square latent reward functions.
  • Real-World Text Contextual Bandits: Using OneShotWikiLinks (semantic arm features) and AmazonCat-13K (non-semantic integer arm tags), TS-LLM is compared against a baseline of direct LLM arm selection adapted for contextual bandits.
    • In OneShotWikiLinks, where arm semantic meaning helps the LLM, TS-LLM performs comparably to the direct selection baseline.
    • In AmazonCat-13K, where semantic meaning is less useful and exploration is critical, TS-LLM dramatically outperforms the direct selection baseline, especially with more arms (K=30K=30). This highlights the benefit of the classical MAB exploration mechanism when the LLM's prior knowledge isn't sufficient.

Ablation studies confirm the importance of a decaying temperature schedule in TS-LLM for balancing exploration and exploitation, the trade-off between approximation quality and cost when choosing NN in TS-LLM-DB, and the impact of the exploration parameter γ\gamma in RO-LLM on the exploitation-exploration balance.

In summary, the paper proposes a practical paradigm for leveraging LLMs in sequential decision-making by integrating them into classical MAB algorithms for prediction tasks. This hybrid approach combines the predictive power of LLMs with the principled exploration/exploitation strategies of established algorithms, leading to improved performance, particularly in challenging MAB tasks where efficient exploration is crucial. Practical implementation involves careful prompt design, managing LLM temperature schedules, and considering the computational cost of LLM calls.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.