Multi-query Beam Search Overview

Updated 7 December 2025

Multi-query beam search is a framework that concurrently maintains and expands multiple query hypotheses for enhanced exploration.
It employs state representation, expansion operators, and scoring methods to optimize multi-hop retrieval and decoding tasks.
Applications include query rewriting in conversational systems, vectorized decoding in speech recognition, and iterative LLM reasoning with notable performance gains.

Multi-query beam search is a general family of algorithms that apply beam search principles to scenarios involving the expansion, composition, or exploration of multiple query hypotheses in parallel. This design paradigm is central to a class of approaches in information retrieval, question answering, sequence generation, and decoding for large language and speech models. Distinct from classic single-sequence beam search, multi-query beam search enables more exhaustive reasoning over multiple alternatives, supports exploration in multi-hop or multi-step contexts, and leverages parallelism for computational gains.

1. Conceptual Foundations and Algorithmic Patterns

Multi-query beam search generalizes the conventional left-to-right beam search, allowing multiple queries, query representations, or hypotheses to be maintained and expanded in parallel. In this framework, the beam is a set of states (partial queries, candidate reasoning chains, or output hypotheses), and at each expansion step, a subset of the most promising states is retained according to a scoring function. Each state may encode not only an output sequence but also a structured history (retrieved passages, reasoning chains, or sub-query expansions), and the search may involve composition, retrieval, or re-querying at each hop.

Key algorithmic elements include:

State representation: Each beam item encodes a hypothesis, which may be a completed sequence, evidence chain, or tuple of queries and associated metadata.
Expansion operator: Each hypothesis is expanded (e.g., via token-level generation, retrieval call, or LLM-driven sub-query generation), producing multiple new candidates.
Score and prune: Newly generated candidates are scored; only the top-K or top-B are retained.

This formalism supports applications in multi-step dense retrieval (Zhao et al., 2021), conversational rewriting (Kostric et al., 27 Jun 2024), LLM-based reasoning (Sun et al., 2023), and vectorized decoding for speech recognition (Seki et al., 2018).

2. Formal Models: Multi-Hop Retrieval and Embedding Composition

In multi-hop or multi-step retrieval, as instantiated by Beam Dense Retrieval (BeamDR), each state in the beam is a tuple of a composed query embedding, a retrieved evidence sequence, and a cumulative score. The process unfolds over $T$ hops:

The initial query embedding $\phi(q_1)$ is generated from the input question.
At step $t$ , each beam element retrieves top-K passages via maximum similarity in a learned dense space.
For each expansion with passage $d_t$ , a new composed query embedding is created:

$\phi(q_{t+1}) = f_{\rm comp}(\phi(q_t), \phi(d_t))$

where $f_{\rm comp}$ is typically a parametric function such as a 2-layer MLP or a recurrent unit (e.g., GRU).

The cumulative reasoning score for a chain $b=(d_1,\dots,d_t)$ is accumulated as:

$s(b) = \sum_{\tau=1}^t \log \mathrm{sim}(\phi(q_\tau), \phi(d_\tau))$

At each step, all $B \times K$ expanded candidates are scored and the beam is pruned back to size $B$ (Zhao et al., 2021).

This structure captures the combinatorial space of evidence chains, supports reasoning via composed query semantics, and optimizes for multi-hop retrieval objectives.

3. Applications in Query Rewriting and Expansion

In conversational retrieval, multi-query beam search is leveraged to produce a set of plausible rewritten queries from a given user utterance. Standard seq2seq query rewriters operate beam search at the token-level, typically returning the most likely (top-1) rewrite. Multi-query beam search instead emits the top-n full hypotheses from the final beam, exploiting the fact that beam search already enumerates these alternatives "for free" (no additional forward passes are necessary).

Formally, for each input utterance and context:

Run beam search with width $K$ : this yields up to $K$ complete rewrites.
Compute a length-normalized rewrite score for each output:

$RS(q_i^j) = \exp \left( \frac{\mathrm{score}_j}{|q_i^j|} \right)$

For sparse retrieval, all rewrites are fused into a single weighted bag-of-words; for dense retrieval, their embeddings are linearly combined, weighted by $RS$ .

This approach yields statistically significant improvements in standard passage retrieval benchmarks, with absolute MRR improvements between +1.06 and +6.31 points in sparse retrieval, and +3.52 to +4.45 in dense retrieval (Kostric et al., 27 Jun 2024).

4. Batched and Vectorized Search for Efficient Decoding

Multi-query beam search is also realized through vectorized and batched hypotheses expansion, which is critical for high-throughput decoding in speech recognition and sequence-to-sequence generation. Rather than maintaining per-hypothesis expansion through serial (for-loop) computation, the hypothesis space is organized into tensors (with shape $S \cdot B$ ), enabling a single forward decoder call and efficient batched top-K selection at every step.

For speech recognition:

Each time step operates on a tensor representing $S$ utterances, each with $B$ beam hypotheses.
Local and global top-K pruning are executed over these batched structures.
Optional integration with external RNNLM and CTC prefix-scorers is achieved via shallow fusion, and all scoring is performed on batched tensors for efficiency.
Empirical results show 3.7x speedup on CPU and 10.5x on GPU for $S=1$ (online) and substantial gains for $S=8$ (offline), with no loss in accuracy (Seki et al., 2018).

Analogous batched techniques are fundamental to scalable beam search implementations for LLM decoding, especially when adopting trie-structured search spaces to exploit prefix sharing (Chan et al., 31 Jan 2025).

5. Iterative LLM Query Expansion and Reasoning

The iterative use of multi-query beam search for open-domain or multi-hop question answering with LLMs is exemplified by the ALLIES framework. At each depth:

Each beam element comprises the original query, a history of generated sub-queries, evidence sets, a current candidate answer, and a confidence score.
Expansion is driven by LLM prompts that generate up to $K$ new sub-questions, each designed to elicit different aspects or decomposition strategies for the original question.
Each child sub-query triggers independent evidence retrieval, answer generation, and confidence scoring.
Beam pruning by confidence ensures only the top- $B$ reasoning paths are maintained.
Stopping is conditioned on a score threshold, fostering both exploration and early convergence.

With typical hyperparameters ( $K=2$ –$3$, $B=2$ –$3$, $D=1$ –$2$), the method achieves optimal performance with minimal increase in LLM call overhead (<50 calls per query) and demonstrated state-of-the-art results on NQ, TriviaQA, and WebQ (e.g., 38.0 EM on NQ) (Sun et al., 2023).

6. Complexity and Empirical Performance

The complexity of multi-query beam search depends on the nature of the hypothesis space and the downstream expansion operation:

Multi-hop dense retrieval: $O(B T (\log N + d^2 + K \log(BK)))$ per query, where $B$ is beam size, $T$ hops, $N$ corpus size, and $d$ embedding dimension (Zhao et al., 2021).
Vectorized decoding: For batched beam search, scoring and pruning per step require $O(BV)$ and $O(B^2+B\log B)$ ; the entire process benefits substantially from parallel execution and high memory bandwidth (Seki et al., 2018).
Trie-based decoding: Reduces memory from $O(b T C)$ for batch-based beam search to $O(T C)$ , where $C$ is the per-token KV cache footprint. End-to-end latency is within 5–10% of the conventional batch method, while enabling order-of-magnitude memory reduction, especially for $b>8$ or $T>10^3$ (Chan et al., 31 Jan 2025).

Empirical evidence consistently demonstrates gains in coverage, retrieval recall, and throughput. For example, BeamDR achieves 68.3% support recall (both gold hops) on HotpotQA versus 45.2% for a single-step method (Zhao et al., 2021). In LLM-based ODQA, multi-query expansion yields +3.5 EM improvement over strong baselines and higher retrieval R@1 (e.g., 69.6% vs. 59.2% on NQ) (Sun et al., 2023).

7. Practical Considerations and Trade-offs

Beam size and branching factors: While increased beam size or number of expansions ( $K$ ) boosts evidence coverage and recall, it multiplies computational and memory requirements. In practice, modest values ( $B=2$ –$5$, $K=2$ –$10$) yield good recall/efficiency trade-offs.
Fusion and scoring: Length normalization, sparse/dense fusion, and confidence scoring are critical for aggregating multi-query evidence.
Hardware and memory: Vectorized and trie-based multi-query approaches reduce total model calls, leverage hardware parallelism, and mitigate memory bottlenecks for large-scale or real-time workloads (Seki et al., 2018, Chan et al., 31 Jan 2025).
System integration: Multi-query beam search methods are model-agnostic and can be integrated into any retrieval pipeline, sequence generator, LLM API stack, or multi-modal system, provided the expansion, pruning, and scoring mechanisms are appropriately defined.

In summary, multi-query beam search provides a generic, extensible, and empirically validated approach for multi-path exploration in reasoning, retrieval, and generation tasks, trading modest increases in search complexity for consistent improvements in accuracy, recall, and robustness across diverse application domains (Zhao et al., 2021, Kostric et al., 27 Jun 2024, Sun et al., 2023, Seki et al., 2018, Chan et al., 31 Jan 2025).