- The paper introduces hybrid algorithms that combine classical MAB methods with LLM-based prediction to improve reward estimation.
- The paper details three novel variants—TS-LLM, RO-LLM, and TS-LLM-DB—each employing decaying temperature schedules and tailored exploration strategies.
- The paper demonstrates through experiments lower cumulative regret across synthetic and real-world tasks compared to direct LLM arm selection.
This paper addresses the limitations of directly using LLMs for arm selection in Multi-Armed Bandit (MAB) problems, an approach previously shown to be suboptimal in many scenarios. Instead of replacing classical MAB algorithms with LLMs, the authors propose enhancing them by leveraging LLMs' strong in-context learning capabilities for specific sub-tasks, particularly reward or preference prediction.
The core idea is to use a classical MAB algorithm as the high-level decision-making framework while employing an LLM as a powerful, flexible predictor within that framework. This allows the algorithms to benefit from the theoretical guarantees and exploration/exploitation strategies of classical methods while using the LLM to handle complex, unknown reward functions without requiring explicit function class specification or fine-tuning.
The paper introduces three specific LLM-enhanced algorithms:
- Thompson Sampling with LLM (TS-LLM): For classical stochastic MAB.
- Implementation: In each iteration, the LLM predicts the reward for each arm based on the history of observed arm features and rewards. The arm with the highest predicted reward is selected.
- Exploration/Exploitation: Relies on the inherent randomness in LLM outputs for exploration. A decaying temperature schedule for the LLM is used to transition from high exploration (large temperature) in initial stages to high exploitation (low temperature) later on.
- Practical Details: The prompt includes the history of (arm feature, reward) pairs and the current arm feature to be predicted. The output format is specified to extract the reward value.
- Regression Oracle-Based Bandit with LLM (RO-LLM): Also for classical stochastic MAB, based on the SquareCB algorithm.
- Implementation: The LLM predicts the loss (negated reward) for each arm based on history. The arm with the minimum predicted loss (jt) is identified. A sampling distribution pt over arms is constructed based on the predicted losses, where arms predicted to have lower loss (especially jt) have higher sampling probability. The next arm is sampled from pt.
- Exploration/Exploitation: Uses an explicit exploration mechanism embedded in the sampling distribution pt, controlled by parameter γ.
- Practical Details: The LLM temperature is set to 0 to obtain deterministic loss predictions, as exploration is handled explicitly by the algorithm. The prompt structure is similar to TS-LLM but uses losses.
- Thompson Sampling with LLM for Dueling Bandits (TS-LLM-DB): For dueling bandits where binary preference feedback between arm pairs is observed.
- Implementation: The LLM predicts the probability that one arm is preferred over another based on the history of (arm pair features, preference feedback).
- Arm Pair Selection: The first arm (it,1) is selected by approximately maximizing the Borda function, estimated by averaging the LLM's predicted preference probability of arm i over N uniformly sampled arms. The second arm (it,2) is selected by maximizing the predicted probability that arm j is preferred over the chosen first arm it,1.
- Practical Details: Features for the LLM prompt are either the difference (xi−xj) or concatenation ([xi,xj]) of the arm features, depending on the suspected form (linear/non-linear) of the latent reward function. A decaying temperature schedule is used, potentially with lower temperatures for selecting the first arm (exploitation-focused) than the second (balancing exploration/exploitation). The parameter N influences the quality of the Borda function approximation but increases LLM calls and cost.
The paper evaluates these algorithms through extensive experiments:
- Synthetic MAB: Experiments with linear, square, sinusoidal, and Gaussian Process reward functions demonstrate that both TS-LLM and RO-LLM achieve lower cumulative regret compared to baseline methods that directly instruct the LLM to select arms. TS-LLM generally performs better, attributed to its exploration strategy via LLM randomness, while RO-LLM shows lower variance due to deterministic predictions.
- Synthetic Dueling Bandits: TS-LLM-DB shows significantly better performance (lower cumulative regret of the first arm) than random search in tasks with linear and square latent reward functions.
- Real-World Text Contextual Bandits: Using OneShotWikiLinks (semantic arm features) and AmazonCat-13K (non-semantic integer arm tags), TS-LLM is compared against a baseline of direct LLM arm selection adapted for contextual bandits.
- In OneShotWikiLinks, where arm semantic meaning helps the LLM, TS-LLM performs comparably to the direct selection baseline.
- In AmazonCat-13K, where semantic meaning is less useful and exploration is critical, TS-LLM dramatically outperforms the direct selection baseline, especially with more arms (K=30). This highlights the benefit of the classical MAB exploration mechanism when the LLM's prior knowledge isn't sufficient.
Ablation studies confirm the importance of a decaying temperature schedule in TS-LLM for balancing exploration and exploitation, the trade-off between approximation quality and cost when choosing N in TS-LLM-DB, and the impact of the exploration parameter γ in RO-LLM on the exploitation-exploration balance.
In summary, the paper proposes a practical paradigm for leveraging LLMs in sequential decision-making by integrating them into classical MAB algorithms for prediction tasks. This hybrid approach combines the predictive power of LLMs with the principled exploration/exploitation strategies of established algorithms, leading to improved performance, particularly in challenging MAB tasks where efficient exploration is crucial. Practical implementation involves careful prompt design, managing LLM temperature schedules, and considering the computational cost of LLM calls.