Tokenized Bandit for LLM Decoding and Alignment (2506.07276v1)

Published 8 Jun 2025 in cs.LG and cs.AI

Abstract: We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t \in [T]$, a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query. In both problems, we first show that learning is impossible without any structure on the sequence function. We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret $\tilde{O}(L\sqrt{T})$ and $\tilde{O}(L\sqrt{T^{2/3}})$ for TLB and TMAB, respectively. As a side product, we obtain an (almost) optimality of the greedy decoding for LLM decoding algorithm under DDMC, which justifies the unresaonable effectiveness of greedy decoding in several tasks. This also has an immediate application to decoding-time LLM alignment, when the misaligned utility can be represented as the frozen LLM's utility and a linearly realizable latent function. We finally validate our algorithm's performance empirically as well as verify our assumptions using synthetic and real-world datasets.

Summary

The paper introduces TLB and TMAB frameworks that adapt bandit problems for sequential token selection in LLM decoding.
It provides formal regret bounds under the DDMC assumption and proposes novel algorithms, EOFUL and GreedyETC, that achieve sublinear regret.
The work offers a theoretical foundation for the near-optimality of greedy decoding, enabling efficient LLM alignment without extensive retraining.

An Overview of Tokenized Bandit for LLM Decoding and Alignment

This paper introduces novel tokenized variants of linear and multi-armed bandit problems specifically designed to address the challenges involved in LLM decoding and alignment. The proposed frameworks, named Tokenized Linear Bandit (TLB) and Tokenized Multi-Armed Bandit (TMAB), aim to optimize the utility of generated sequences by sequentially selecting tokens in a manner aligned with user preferences, without necessitating extensive retraining or updates to the underlying model.

Key Contributions

The paper makes several significant contributions to the understanding and application of bandit problems in the context of LLMs:

Introduction of TLB and TMAB: The authors present these tokenized variants as adaptations of standard linear and stochastic multi-armed bandit problems, tailored to the specific task of token selection in LLMs. These frameworks facilitate the understanding of how to sequentially construct text outputs that align with human preferences while maintaining computational efficiency.
Regret Analysis: For both TLB and TMAB, the paper provides formal regret bounds under the assumption of a novel structural property termed "Diminishing Distance with More Commons" (DDMC). This assumption posits that as more common tokens are appended between sequences, the utility gap diminishes, thus leveraging the shared structure of token sequences for prediction.
Algorithmic Solutions: The authors propose specific algorithms, EOFUL and GreedyETC, offering sublinear regret in TLB and TMAB scenarios, respectively. These algorithms utilize optimistic approaches typical in bandit literature, supplemented by unique heuristics suitable for token sequence construction.
Theoretical Justification for Greedy Decoding: A noteworthy finding is the near-optimality of greedy decoding strategies within the DDMC framework. This offers an important theoretical backing for the empirical success observed for greedy decoding strategies in a variety of LLM tasks.
LLM Alignment Application: The paper presents a compelling application of these bandit variants in aligning LLM outputs with user preferences during decoding, providing an efficient alternative to high-resource fine-tuning methods such as RLHF.

Implications and Future Directions

This research has several implications for the field of AI, particularly in enhancing the personalization and efficiency of LLMs in real-world applications. By effectively aligning LLM outputs during the generation process without the need for extensive computational overhead, this work potentially broadens the applicability of LLMs in highly dynamic environments where user preferences might rapidly change.

Future developments could explore the relaxation of the DDMC assumption further, possibly extending these frameworks to more general sequence functions or exploring the impact of different decoding strategies beyond greedy ones. Moreover, integrating these models with more adaptive structures in LLMs could enhance their robustness and adaptability across a wider array of language tasks.

Conclusion

The introduction of TLB and TMAB in this paper represents a stride towards harnessing the power of bandit frameworks for optimizing LLM outputs. The theoretical contributions, coupled with robust empirical validation, chart a promising course for aligning machine-generated text more closely with human preferences in a resource-efficient manner. This foundational work sets the stage for subsequent innovations in interactive AI systems and personalized LLM applications.