Bidirectional Beam Search (BiBS)

Updated 13 April 2026

Bidirectional Beam Search is a family of approximate inference algorithms that use both past and future context to enhance sequence-level coherence.
It incorporates variants like coordinated search, synchronous expansion, and LM fusion to improve tasks such as machine translation and summarization.
BiBS methods balance forward and backward probability contributions, yielding superior metrics (e.g., BLEU, ROUGE, CIDEr) compared to unidirectional approaches.

Bidirectional Beam Search (BiBS) comprises a family of approximate inference algorithms designed for decoding with bidirectional neural sequence models. By leveraging both past and future context during sequence generation, BiBS methods address intrinsic limitations of unidirectional decoding, particularly when applied to challenging tasks such as fill-in-the-blank generation, abstractive summarization, neural response generation, machine translation, and connectionist temporal classification (CTC) decoding. The BiBS framework encompasses variants that differ in how bidirectional information is combined and exploited during inference, but all aim to improve sequence-level coherence and global optimality relative to conventional left-to-right (L2R) or right-to-left (R2L) beam search.

1. Motivation and Foundations

Traditional sequence generation in neural models (e.g., RNN, LSTM, GRU, Transformer) employs unidirectional decoding, in which the model generates each token $y_t$ conditioned only on the prefix $y_{1:t-1}$ and, optionally, a representation of the source input. This sequential generation paradigm limits the decoder’s ability to utilize future information, resulting in characteristic pathologies such as overproduction of initial phrases, repetition, and unbalanced coverage of source content—especially pronounced in long output sequences (Al-Sabahi et al., 2018, Sun et al., 2017, Zhang et al., 2019). Bidirectional sequence models, including bidirectional decoders and LLMs, offer a richer conditional structure by conditioning each output token on both its prefix and suffix (i.e., $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ ), but they introduce computational intractability for exact inference due to the exponentially large space of possible output sequences.

BiBS algorithms are motivated by the need for tractable, approximate decoding methods that allow bidirectional models to realize their potential for globally coherent sequence generation without prohibitive computational cost (Sun et al., 2017). They achieve this by various means, such as coordinate descent over prefix and suffix beams, synchronous expansion of L2R and R2L hypotheses, or reranking based on bidirectional scoring criteria.

2. Formal Decoding Objectives and Key Variants

The general BiBS decoding objective is to approximate $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ under a model that factorizes conditionally on both left and right context. Two principal mathematical formulations are prevalent:

Joint Bidirectional Log-Score Objective (Al-Sabahi et al., 2018):

$S_{\text{fw}}(Y) = \sum_{t=1}^T \log p_{\text{fw}}(y_t | y_{1:t-1}, x), \quad S_{\text{bw}}(Y) = \sum_{t=1}^T \log p_{\text{bw}}(y_t | y_{t+1:T}, x)$

$\hat{Y} = \arg\max_Y[\gamma S_{\text{fw}}(Y) + (1-\gamma) S_{\text{bw}}(Y)]$

where $\gamma \in [0,1]$ balances forward and backward contributions.

Bidirectional RNN Product-of-Conditionals (Sun et al., 2017):

$p(Y|X) \propto \prod_{t=1}^T p(y_t|y_{1:t-1}, y_{t+1:T}, X)$

Approximated via coordinate updates that alternate between left-to-right and right-to-left passes, each pass holding fixed the opposite direction’s beams.

In addition, bidirectional strategies can be classified as:

Coordinated Bidirectional Beam Search (standard BiBS; e.g., (Sun et al., 2017)): Iterative forward/backward passes update and coordinate prefix/suffix beams to maximize the approximate joint probability.
Synchronous Interactive Decoding (Zhang et al., 2019): L2R and R2L hypotheses are expanded in parallel at each time step, with cross-directional state-sharing.
Bidirectional Scoring/Aggregation (BidiS/BidiA) (Colombo et al., 2021): Combines the scores of independent forward and backward beam searches via log-probability addition or similarity-based agreement.
Bidirectional LM Fusion for CTC (Jung et al., 2021): Shallow fusion of forward and backward LMs using a noisy (greedy) future estimate during prefix beam search.

3. Algorithmic Instantiations

3.1 Bidirectional Beam Search for Bidirectional Decoder Models

The quintessential BiBS for fully bidirectional decoders comprises the following sequence (Al-Sabahi et al., 2018):

Backward Beam Presearch: Compute a backward beam of $K$ high-scoring suffixes by running beam search from position $T$ to $y_{1:t-1}$ 0; cache cumulative backward log-probability arrays for these sequences.
Forward Beam with Bidirectional Scoring: At each decoding step $y_{1:t-1}$ 1, extend prefixes in the forward beam by a candidate token $y_{1:t-1}$ 2, and for each, retrieve the best-matching backward beam suffix and sum their prefix and suffix scores—weighted by $y_{1:t-1}$ 3.
Beam Maintenance: Retain the top $y_{1:t-1}$ 4 prefixes at each step according to the joint score. Decoding proceeds until $y_{1:t-1}$ 5 or until EOS is reached in all beams.

3.2 Synchronous Interactive Beam Search

L2R and R2L beams of size $y_{1:t-1}$ 6 are expanded synchronously at each time step, with scoring/attention incorporating the opposite direction’s decoder states (via cross-attention in LSTM/Transformer decoders). At each step, candidates are scored by combining context from both sides, then pruned. Decoding continues until sufficient complete hypotheses are gathered (Zhang et al., 2019).

3.3 Coordinate-Descent BiBS

BiBS as originally proposed for bidirectional models (Sun et al., 2017) executes alternating left-to-right and right-to-left passes over the current beam set, at each step updating beam hypotheses using cached log-probabilities from both directions. Candidate sequences are built from the Cartesian product of prefix beams, possible next tokens, and suffix beams.

3.4 Bidirectional CTC Decoding

For CTC, BiBS employs a greedy CTC decoding output as a noisy estimate of the future label sequence. During prefix beam search, the current forward LM state is fused with the corresponding backward LM state precomputed from the approximated suffix, with the joint (forward + backward) LM score added to the acoustic model’s log-probabilities (Jung et al., 2021).

4. Computational Complexity and Resource Analysis

The added complexity of BiBS relative to standard beam search arises primarily from candidate set expansion and the maintenance/coordination of multiple beams:

Standard Beam Search: $y_{1:t-1}$ 7 candidate scoring per pass.
Coordinate-Descent BiBS: $y_{1:t-1}$ 8 for $y_{1:t-1}$ 9 rounds, due to cross-beam Cartesian product at each time step (Sun et al., 2017).
BiBS with Suffix Presearch: $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 0, i.e., roughly $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 1 the cost of unidirectional search (Al-Sabahi et al., 2018).
Synchronous Bidirectional Beam Search: $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 2 per pass plus up to $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 3 for full cross-attention in the LSTM case (Zhang et al., 2019).
Bidirectional LM Fusion for CTC: One backward pass over the greedy sequence (cost $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 4) plus standard prefix-beam search, yielding a minor wall-clock overhead (Jung et al., 2021).

In all cases, space complexity is dominated by the need to store beam hypotheses and associated RNN/Transformer hidden states, typically $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 5 per direction.

5. Empirical Results and Applications

Empirical evaluations across tasks consistently demonstrate that BiBS and related bidirectional inference strategies yield improved sequence-level metrics over standard unidirectional beam search.

Fill-in-the-Blank Image Captioning (COCO, Visual Madlibs): BiBS achieves higher CIDEr, BLEU-4, and METEOR scores compared to unidirectional and naive bidirectional baselines. For instance, on COCO (r=50%, B=5, M=4): BiBS obtained CIDEr=4.26, BLEU-4=0.408, METEOR=0.368 versus URNN-f+b CIDEr=4.15 (Sun et al., 2017).
Abstractive Summarization (CNN/DailyMail, DUC-2004): Bidirectional beam search models outperform state-of-the-art unidirectional models, with best tuning at $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 6 (Al-Sabahi et al., 2018). For example, BI‐RNMT produced ROUGE-1/2/L 29.05/10.90/26.05 (DUC-2004) vs. standard RNMT 28.22/10.21/25.14 (Zhang et al., 2019).
Neural Machine Translation (NIST, WMT14): Synchronous BiBS yielded substantial BLEU gains, e.g. Transformer baseline BLEU 47.19 (MT03-06 avg) versus BIFT 51.11 (Δ+3.92) (Zhang et al., 2019).
Neural Response Generation: Both BidiS and BidiA strategies on dialog corpora provide nontrivial BLEU-4 and diversity gains, especially at large beam sizes (e.g., Cornell Movie Dialog BLEU-4: VBS 1.30, BidiA 1.42 at B=50) (Colombo et al., 2021).
Speech Recognition (CTC Decoding, Librispeech): BiBS reduces character error rate at the start of utterances by >20% relative and overall by up to 8% compared to strong unidirectional baselines (Jung et al., 2021).

6. Model Architectures, Training, and Interaction Mechanisms

Model Architecture: All BiBS frameworks require bidirectional models: either with bidirectional encoder-decoders (forward and backward LSTM, GRU, or self-attention decoders) (Al-Sabahi et al., 2018), independently trained forward/reverse sequence decoders (Colombo et al., 2021), or forward/backward LMs for LM-augmented tasks (e.g., CTC) (Jung et al., 2021).
Decoder Initialization: In architectures with bidirectional encoders and decoders, the forward decoder initializes from the backward encoder’s final state, and vice versa (Al-Sabahi et al., 2018).
Score Aggregation: Forward and backward contributions are linearly combined via a balancing parameter ( $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 7), or combined post-hoc via reranking (Al-Sabahi et al., 2018, Colombo et al., 2021).
LM Fusion (CTC context): Backward LM context is computed from a noisy future obtained with a greedy CTC pass. A future-shifting mechanism and synthetic noise ensure robustness to imperfect future estimates (Jung et al., 2021).
Cross-Attention Mechanisms: Synchronous BiBS exploits cross-directional attention between hidden states at each time step, with both LSTM/GRU and Transformer decoders implementing specialized attention/fusion structures (Zhang et al., 2019).

7. Design Principles, Hyperparameter Tuning, and Limitations

Beam Width ( $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 8): Typical values 4–10; larger beams provide more hypothesis diversity at linear runtime cost (Al-Sabahi et al., 2018, Zhang et al., 2019).
Score Weighting Parameter ( $p(y_t \mid y_{1:t-1}, y_{t+1:T}, x)$ 9): Values in $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 0 balance past and future; $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 1 recovers purely forward decoding (Al-Sabahi et al., 2018).
Sequence Length Constraint: $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 2 controls output length; can be used to enforce compression ratios in summarization (Al-Sabahi et al., 2018).
Length Normalization: Standard exponents ( $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 3, $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 4) can correct biases toward very short or very long outputs.
Bidirectional LM Training (CTC): Future-noising with a shift $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 5 and corruption matching the error profile of the greedy future is essential for robust performance (Jung et al., 2021).
Scaling Limitations: All BiBS approaches incur higher computational and memory cost versus standard beam search, e.g., $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 6 expansion in coordinate-descent variants and need to cache beams from both directions (Sun et al., 2017). Many variants are practical only for moderate beam sizes ( $\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 7).
Approximation Quality: BiBS does not recover the exact MAP assignment for the full bidirectional model; all algorithmic forms (coordinate descent, additive scoring, synchronous expansion) are upper-bounded by their respective approximations. However, empirical convergence is rapid (typically 1–4 passes) and consistently yields higher joint log-probabilities and better output coherence (Sun et al., 2017).

Table: Summary of Key BiBS Algorithms

Algorithm	Core Mechanism	Complexity (per pass)
Coord. BiBS	Forward/backward alt. beam expansion	$\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 8
Synchronous BS	Parallel L2R/R2L with cross-attention	$\hat{Y} = \arg\max_{Y} p(Y \mid X)$ 9
BidiS/BidiA	Forward/reverse reranking/agreement	$S_{\text{fw}}(Y) = \sum_{t=1}^T \log p_{\text{fw}}(y_t \| y_{1:t-1}, x), \quad S_{\text{bw}}(Y) = \sum_{t=1}^T \log p_{\text{bw}}(y_t \| y_{t+1:T}, x)$ 0 (+ $S_{\text{fw}}(Y) = \sum_{t=1}^T \log p_{\text{fw}}(y_t \| y_{1:t-1}, x), \quad S_{\text{bw}}(Y) = \sum_{t=1}^T \log p_{\text{bw}}(y_t \| y_{t+1:T}, x)$ 1 for BidiA)
CTC BiBS	Prefix beam + future-augmented BiLM	$S_{\text{fw}}(Y) = \sum_{t=1}^T \log p_{\text{fw}}(y_t \| y_{1:t-1}, x), \quad S_{\text{bw}}(Y) = \sum_{t=1}^T \log p_{\text{bw}}(y_t \| y_{t+1:T}, x)$ 2