Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Turn Beam Search

Updated 7 April 2026
  • Multi-turn beam search is an extension of classical beam search that uses multi-step lookahead to evaluate cumulative scores over growing sequences.
  • It enhances performance in tasks like dialogue generation, dense retrieval, and reinforcement learning by simulating future states and maintaining multiple candidate paths.
  • Empirical studies demonstrate that this approach improves recall, human evaluation scores, and task efficiency while balancing computational trade-offs through parameters like beam width and lookahead depth.

Multi-turn beam search is an extension of classical beam search designed to address sequence generation or decision-making tasks where each choice influences not only immediate outputs but also future states or actions over multiple turns. Unlike single-step or greedy search, multi-turn beam search explicitly performs lookahead or rollout over several steps, maintaining multiple hypotheses (beams) across the growing sequence and pruning less promising paths according to cumulative or proxy scores. This principle is now widely applied in natural language generation, multi-step retrieval, and multi-turn reinforcement learning, where the distribution of future outcomes is often multi-modal and critically dependent on earlier decisions.

1. Core Algorithms and Formulations

At its heart, multi-turn beam search generalizes the traditional beam search algorithm to cover sequences (utterances, passages, actions) spanning multiple decision points, incorporating explicit future modeling.

Given a prefix (partial solution) at turn tt, each beam is expanded by enumerating a set of candidate continuations (tokens, passages, actions), scoring each using domain-appropriate functions (e.g., log-likelihood, inner product, or reward), and advancing each hypothesis with its candidate expansions. After all expansions, only the top-BB (beam width) hypotheses—across all beams and candidates—are retained for the next turn. This procedure is iterated for a specified horizon TT (turns, steps, or reasoning hops).

Generic Pseudocode Skeleton

O(L×KTlogK)\mathcal{O}(L \times K T \log K)2

Specializations differ in the expansion mechanism, scoring, and how the next state is composed.

2. Multi-Turn Beam Search in Dialogue and Language Generation

In neural dialogue modeling and autoregressive sequence generation, multi-turn beam search addresses myopia in conventional decoding by simulating both self and dialogue partner over several future turns. The search alternates between hypothesized utterances of the model and an approximated partner model, enabling the evaluation of each candidate utterance by rolling out likely conversation trajectories.

Formally, the objective can be either marginal over all possible futures or, in practical settings, over the single most likely future (optimistic), yielding a combinatorial argmax. Exact optimization is intractable, but multi-turn beam search approximates this by:

  • Generating a set of candidate next utterances.
  • For each, simulating LL turns of conversation with a partner model (mindless, egocentric, or transparent).
  • Maintaining the top KK overall rollouts according to log-likelihood or composite score.

This yields a time complexity O(L×KTlogK)\mathcal{O}(L \times K T \log K) per decision, with TT the (max) utterance length.

Empirically, this approach on the Persona-Chat corpus (training dialogues ≈$9.9$k, beam widths $10$–$100$, lookahead BB0 up to BB1) produces more coherent, higher-rated responses: e.g., moving from utterance-level to multi-turn beam search (egocentric partner, BB2, BB3) improves human scores (1.98 vs 1.67, scale BB4–BB5) and yields more non-myopic behavior (Kulikov et al., 2019).

3. Multi-Turn Beam Search for Multi-Step Dense Retrieval

In multi-hop question answering over unstructured text, multi-turn beam search underpins frameworks such as "Beam Dense Retrieval" (BeamDR). Here, each step in the chain involves retrieving supporting passages conditioned on both the current query and evidence collected so far. Query embeddings are iteratively composed according to a trainable function BB6, and at each retrieval hop:

  • Each beam tracks a partial evidence chain and its cumulative score.
  • At step BB7, for each beam, BB8 top passages are selected by inner-product similarity with the composed query vector.
  • All BB9 candidate beam extensions are scored and pruned to the top TT0 for the next step.

The composition function TT1 is crucial; omitting it reduces recall by 3–4%. A diversity penalty on candidate tails (with tunable TT2) increases multi-hop recall by up to 1%. When applied to HotpotQA (distractor setting) with TT3, BeamDR achieves chain recall-at-2 of TT4 (vs TT5 for greedy DPR) and improves end-to-end F1 from TT6 to TT7 (Zhao et al., 2021).

4. Multi-Turn Beam Search in Multi-Turn Reinforcement Learning

Multi-turn beam search is also integral to trajectory generation in multi-turn reinforcement learning of LLM agents, such as in the TSR ("Trajectory-Search Rollouts") framework. In partially observed MDP (POMDP) environments, naive trajectory sampling often incurs high variance, sparse rewards, and mode collapse. TSR instead generates K-step rollouts by:

  • At each turn TT8, expanding each beam prefix by sampling TT9 candidate actions from the policy LL0.
  • Each expanded trajectory is scored via a task-specific function LL1 (e.g., immediate reward, progress proxy).
  • The accumulated score guides pruning to the top-LL2 partial trajectories per turn.
  • At the end of the episode/horizon LL3, the highest scoring trajectory is selected for RL policy gradient updates (PPO or GRPO).

This yields performance gains (up to LL4 absolute success-rate improvement) across tasks such as Sokoban and FrozenLake, with moderate search budgets (LL5, LL6) offering the best compute-performance trade-off. Notably, TSR does not alter the standard RL objective; it only refines the trajectory distribution used for return and advantage estimation, yielding substantial variance reduction and robustness against mode collapse ("Echo Trap") (Djuhera et al., 12 Feb 2026).

5. Hyperparameterization, Trade-Offs, and Complexity

Critical parameters for multi-turn beam search are:

Parameter Typical Values Effect/Trade-off
Beam width LL7 1–5 Higher LL8 more exploration, slower search
Per-beam expansion LL9/KK0 4–20 Higher increases breadth, diminishing gains after KK1
Lookahead depth KK2 1–8 Longer lookahead improves non-myopia, overhead KK3
Composition KK4 or partner model Variable Omission degrades recall or human score by 3–4%
Diversity penalty KK5 [0.05, 0.2] Mitigates beam collapse, +1% multi-hop recall

Complexity is multiplicative in beam width, expansion factor, and lookahead (e.g., KK6 in dialogue; KK7 in dense retrieval).

Empirically, gains level off for KK8 and KK9; beyond this compute grows with minimal accuracy improvement.

6. Empirical Findings and Ablations

Quantitative and ablation studies across domains demonstrate that multi-turn beam search:

  • Consistently outperforms greedy and single-step search, both in recall and human evaluations (Kulikov et al., 2019, Zhao et al., 2021, Djuhera et al., 12 Feb 2026).
  • In dialogue, additional lookahead (O(L×KTlogK)\mathcal{O}(L \times K T \log K)0) and informed partner models result in higher-quality, more contextually consistent completions.
  • In dense retrieval, the trained query composition function O(L×KTlogK)\mathcal{O}(L \times K T \log K)1 is critical; omitting it or using untuned expansion/pruning worsens both intermediate (recall) and final (F1) metrics.
  • In RL, beam-searched TSR rollouts yield higher success rates, stabilize learning, shorten LLM responses, and reduce mode collapse versus best-of-N or shallow lookahead sampling.

Ablations confirm that hyperparameter choices (beam width, expansion, penalty) and the form of future modeling or candidate scoring significantly affect both the efficiency and robustness of the search.

7. Significance, Limitations, and Connections

Multi-turn beam search directly addresses the non-myopic nature of sequential generation and decision-making tasks. By systematically expanding possible futures and globally optimizing over multi-turn sequences, it enables more effective navigation of complex, multi-modal search spaces.

However, the method introduces additional computation, with complexity scaling linearly (or quadratically in naive implementations) with beam width, expansion, and lookahead. Diminishing returns beyond moderate beam sizes are routinely observed. Approximate partner models or scoring proxies may introduce bias but generally outperform mindless or myopic alternatives. Notably, the approach is orthogonal to underlying model architectures or learning objectives, serving as a plug-in inference or rollout strategy.

Multi-turn beam search now underpins a spectrum of state-of-the-art systems in dialogue modeling (Kulikov et al., 2019), dense retrieval for multi-step reasoning (Zhao et al., 2021), and RL for LLM agents (Djuhera et al., 12 Feb 2026). Common themes are improvements in sample efficiency, stability, and task-general applicability—grounded not in model rewrites but in search regime alteration. Future developments may focus on adaptive beam budgeting, hybrid global-local search, or learned scoring heuristics, leveraging the strengths of multi-turn exploration with computational tractability.

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-Turn Beam Search.