Multi-Turn Beam Search
- Multi-turn beam search is an extension of classical beam search that uses multi-step lookahead to evaluate cumulative scores over growing sequences.
- It enhances performance in tasks like dialogue generation, dense retrieval, and reinforcement learning by simulating future states and maintaining multiple candidate paths.
- Empirical studies demonstrate that this approach improves recall, human evaluation scores, and task efficiency while balancing computational trade-offs through parameters like beam width and lookahead depth.
Multi-turn beam search is an extension of classical beam search designed to address sequence generation or decision-making tasks where each choice influences not only immediate outputs but also future states or actions over multiple turns. Unlike single-step or greedy search, multi-turn beam search explicitly performs lookahead or rollout over several steps, maintaining multiple hypotheses (beams) across the growing sequence and pruning less promising paths according to cumulative or proxy scores. This principle is now widely applied in natural language generation, multi-step retrieval, and multi-turn reinforcement learning, where the distribution of future outcomes is often multi-modal and critically dependent on earlier decisions.
1. Core Algorithms and Formulations
At its heart, multi-turn beam search generalizes the traditional beam search algorithm to cover sequences (utterances, passages, actions) spanning multiple decision points, incorporating explicit future modeling.
Given a prefix (partial solution) at turn , each beam is expanded by enumerating a set of candidate continuations (tokens, passages, actions), scoring each using domain-appropriate functions (e.g., log-likelihood, inner product, or reward), and advancing each hypothesis with its candidate expansions. After all expansions, only the top- (beam width) hypotheses—across all beams and candidates—are retained for the next turn. This procedure is iterated for a specified horizon (turns, steps, or reasoning hops).
Generic Pseudocode Skeleton
2
Specializations differ in the expansion mechanism, scoring, and how the next state is composed.
2. Multi-Turn Beam Search in Dialogue and Language Generation
In neural dialogue modeling and autoregressive sequence generation, multi-turn beam search addresses myopia in conventional decoding by simulating both self and dialogue partner over several future turns. The search alternates between hypothesized utterances of the model and an approximated partner model, enabling the evaluation of each candidate utterance by rolling out likely conversation trajectories.
Formally, the objective can be either marginal over all possible futures or, in practical settings, over the single most likely future (optimistic), yielding a combinatorial argmax. Exact optimization is intractable, but multi-turn beam search approximates this by:
- Generating a set of candidate next utterances.
- For each, simulating turns of conversation with a partner model (mindless, egocentric, or transparent).
- Maintaining the top overall rollouts according to log-likelihood or composite score.
This yields a time complexity per decision, with the (max) utterance length.
Empirically, this approach on the Persona-Chat corpus (training dialogues ≈$9.9$k, beam widths $10$–$100$, lookahead 0 up to 1) produces more coherent, higher-rated responses: e.g., moving from utterance-level to multi-turn beam search (egocentric partner, 2, 3) improves human scores (1.98 vs 1.67, scale 4–5) and yields more non-myopic behavior (Kulikov et al., 2019).
3. Multi-Turn Beam Search for Multi-Step Dense Retrieval
In multi-hop question answering over unstructured text, multi-turn beam search underpins frameworks such as "Beam Dense Retrieval" (BeamDR). Here, each step in the chain involves retrieving supporting passages conditioned on both the current query and evidence collected so far. Query embeddings are iteratively composed according to a trainable function 6, and at each retrieval hop:
- Each beam tracks a partial evidence chain and its cumulative score.
- At step 7, for each beam, 8 top passages are selected by inner-product similarity with the composed query vector.
- All 9 candidate beam extensions are scored and pruned to the top 0 for the next step.
The composition function 1 is crucial; omitting it reduces recall by 3–4%. A diversity penalty on candidate tails (with tunable 2) increases multi-hop recall by up to 1%. When applied to HotpotQA (distractor setting) with 3, BeamDR achieves chain recall-at-2 of 4 (vs 5 for greedy DPR) and improves end-to-end F1 from 6 to 7 (Zhao et al., 2021).
4. Multi-Turn Beam Search in Multi-Turn Reinforcement Learning
Multi-turn beam search is also integral to trajectory generation in multi-turn reinforcement learning of LLM agents, such as in the TSR ("Trajectory-Search Rollouts") framework. In partially observed MDP (POMDP) environments, naive trajectory sampling often incurs high variance, sparse rewards, and mode collapse. TSR instead generates K-step rollouts by:
- At each turn 8, expanding each beam prefix by sampling 9 candidate actions from the policy 0.
- Each expanded trajectory is scored via a task-specific function 1 (e.g., immediate reward, progress proxy).
- The accumulated score guides pruning to the top-2 partial trajectories per turn.
- At the end of the episode/horizon 3, the highest scoring trajectory is selected for RL policy gradient updates (PPO or GRPO).
This yields performance gains (up to 4 absolute success-rate improvement) across tasks such as Sokoban and FrozenLake, with moderate search budgets (5, 6) offering the best compute-performance trade-off. Notably, TSR does not alter the standard RL objective; it only refines the trajectory distribution used for return and advantage estimation, yielding substantial variance reduction and robustness against mode collapse ("Echo Trap") (Djuhera et al., 12 Feb 2026).
5. Hyperparameterization, Trade-Offs, and Complexity
Critical parameters for multi-turn beam search are:
| Parameter | Typical Values | Effect/Trade-off |
|---|---|---|
| Beam width 7 | 1–5 | Higher 8 more exploration, slower search |
| Per-beam expansion 9/0 | 4–20 | Higher increases breadth, diminishing gains after 1 |
| Lookahead depth 2 | 1–8 | Longer lookahead improves non-myopia, overhead 3 |
| Composition 4 or partner model | Variable | Omission degrades recall or human score by 3–4% |
| Diversity penalty 5 | [0.05, 0.2] | Mitigates beam collapse, +1% multi-hop recall |
Complexity is multiplicative in beam width, expansion factor, and lookahead (e.g., 6 in dialogue; 7 in dense retrieval).
Empirically, gains level off for 8 and 9; beyond this compute grows with minimal accuracy improvement.
6. Empirical Findings and Ablations
Quantitative and ablation studies across domains demonstrate that multi-turn beam search:
- Consistently outperforms greedy and single-step search, both in recall and human evaluations (Kulikov et al., 2019, Zhao et al., 2021, Djuhera et al., 12 Feb 2026).
- In dialogue, additional lookahead (0) and informed partner models result in higher-quality, more contextually consistent completions.
- In dense retrieval, the trained query composition function 1 is critical; omitting it or using untuned expansion/pruning worsens both intermediate (recall) and final (F1) metrics.
- In RL, beam-searched TSR rollouts yield higher success rates, stabilize learning, shorten LLM responses, and reduce mode collapse versus best-of-N or shallow lookahead sampling.
Ablations confirm that hyperparameter choices (beam width, expansion, penalty) and the form of future modeling or candidate scoring significantly affect both the efficiency and robustness of the search.
7. Significance, Limitations, and Connections
Multi-turn beam search directly addresses the non-myopic nature of sequential generation and decision-making tasks. By systematically expanding possible futures and globally optimizing over multi-turn sequences, it enables more effective navigation of complex, multi-modal search spaces.
However, the method introduces additional computation, with complexity scaling linearly (or quadratically in naive implementations) with beam width, expansion, and lookahead. Diminishing returns beyond moderate beam sizes are routinely observed. Approximate partner models or scoring proxies may introduce bias but generally outperform mindless or myopic alternatives. Notably, the approach is orthogonal to underlying model architectures or learning objectives, serving as a plug-in inference or rollout strategy.
Multi-turn beam search now underpins a spectrum of state-of-the-art systems in dialogue modeling (Kulikov et al., 2019), dense retrieval for multi-step reasoning (Zhao et al., 2021), and RL for LLM agents (Djuhera et al., 12 Feb 2026). Common themes are improvements in sample efficiency, stability, and task-general applicability—grounded not in model rewrites but in search regime alteration. Future developments may focus on adaptive beam budgeting, hybrid global-local search, or learned scoring heuristics, leveraging the strengths of multi-turn exploration with computational tractability.