Local Bidirectional Attention (LBA)

Updated 3 January 2026

Local Bidirectional Attention (LBA) is an attention mechanism that integrates local sliding-window processing with bidirectional latent synthesis to capture both short-range and global dependencies.
It employs local and cross-attention blocks to efficiently model sequential data and multimodal correspondences, reducing computational complexity compared to full self-attention.
Empirical studies demonstrate LBA’s effectiveness in boosting accuracy for long-range text parsing and enhancing cross-lingual style transfer in automatic dubbing.

Local Bidirectional Attention (LBA) is an attention mechanism that synthesizes local and bidirectional context, enabling efficient and expressive modeling of sequential data and multimodal word-level correspondences. It is employed in high-efficiency long-range parsers such as BLRP for textual and vision data (Leotescu et al., 2024), as well as in cross-lingual style transfer frameworks for automatic dubbing (Li et al., 2023). LBA constructs local sliding-window or local cross-lingual attention blocks and aggregates bidirectional latent representations, capturing both short-range dependencies and global structure with computational scalability.

1. Formal Definition and Core Principles

LBA builds on the principle of local attention, restricting each input segment’s receptive field to a short neighborhood, and bidirectional passes, whereby local features are synthesized in both forward and backward directions and merged through a global latent representation. This is instantiated in BLRP by partitioning the input $X \in \mathbb{R}^{N \times d}$ into $T$ non-overlapping segments $X_1, ..., X_T$ of length $t$ and constructing, for each segment $i$ , a window $W_i$ containing $w$ tokens from adjacent segments:

$W_i = [X_{i-1}^{(s)}; X_i; X_{i+1}^{(p)}]$

where $X_{i-1}^{(s)}$ are the last $\lfloor w/2 \rfloor$ tokens of the previous segment, and $X_{i+1}^{(p)}$ are the first $\lfloor w/2 \rfloor$ tokens of the next segment. Standard scaled-dot-product self-attention is applied within each window.

In cross-lingual sequence modeling (Li et al., 2023), LBA takes two word-level feature matrices (source and target sequences), projects keys and values for each, and computes a shared bidirectional attention matrix:

$A_{1 \rightarrow 2} = f_1(s_1^{t}) f_2(s_2^{t})^\top$

$A_{2 \rightarrow 1} = A_{1 \rightarrow 2}^\top$

Row-wise softmax yields attention weights in both directions, supporting simultaneous fusion of multimodal features across languages.

2. Mathematical Formulation

BLRP Local Sliding-Window Attention

For each segment $i$ , queries, keys, and values are:

$Q_i = q^S(X_i) \in \mathbb{R}^{t \times d}$
$K_i = k^S(W_i) \in \mathbb{R}^{w \times d}$
$V_i = v^S(W_i) \in \mathbb{R}^{w \times d}$

The local attention output is:

$\Theta^{\rm SELF}(X_i) = \operatorname{softmax}\Bigl(\frac{Q_i K_i^\top}{\sqrt d}\Bigr) V_i$

with each pre-softmax element:

$s_{j,m} = \frac{1}{\sqrt d} \, q^S(x_{(i-1)t + j}) \cdot k^S(w_m)$

Softmax normalization over window positions $m$ yields attention weights $\alpha_{j,m}$ :

$\bigl[\Theta^{\rm SELF}(X_i)\bigr]_{j,:} = \sum_{m=1}^w \alpha_{j,m} v^S(w_m)$

Bidirectional Latent-Space Synthesis

A latent block $L \in \mathbb{R}^{l \times d}$ is updated via two passes:

Forward: $L_i^F = \Theta_L^{CROSS}[L_{i-1}^F, [\tilde X_i^F ; L^{\rm INIT}]]$
Backward: $L_i^B = \Theta_L^{CROSS}[L_{i+1}^B, [\tilde X_i^B ; L^{\rm INIT}]]$

Each cross-attention step uses scaled-dot-product attention analogous to self-attention.

Cross-Lingual Bidirectional Attention

Local bidirectional attention between source and target sequences uses two sets of keys/values $([f_1(s_1^t), s_1^m], [f_2(s_2^t), s_2^m])$ :

$A_{1 \rightarrow 2}$ yields source $\rightarrow$ target alignment.
$W_{1 \rightarrow 2} = \operatorname{softmax}(A_{1 \rightarrow 2})$
Summarized outputs: $O_{1 \rightarrow 2} = W_{1 \rightarrow 2}^\top s_1^m$ and $O_{2 \rightarrow 1} = W_{2 \rightarrow 1}^\top s_2^m$ Concatenated with textual features and projected to generate predicted local style tokens.

3. Algorithm and Implementation

Pseudocode for BLRP’s LBA module:

def BLRP_LocalBidirectional(X):
    segments = split(X, t)
    L_init_F = Theta_proj_forward(X)
    L_init_B = Theta_proj_backward(X)
    L_F_prev = L_init_F
    L_B_next = L_init_B
    
    # Forward pass
    for i in range(1, T+1):
        W_i = get_window(segments, i, w)
        S_i = Softmax((q^S(segments[i]) @ k^S(W_i).T) / sqrt(d)) @ v^S(W_i)
        Xtilde_i_F = Θ_X^CROSS(S_i, L_F_prev)
        L_F_i = Θ_L^CROSS(L_F_prev, concatenate(Xtilde_i_F, L_init_F))
        L_F_prev = L_F_i
    
    # Backward pass
    for i in reversed(range(1, T+1)):
        W_i = get_window(segments, i, w)
        S_i = Softmax((q^S(segments[i]) @ k^S(W_i).T) / sqrt(d)) @ v^S(W_i)
        Xtilde_i_B = Θ_X^CROSS(S_i, L_B_next)
        L_B_i = Θ_L^CROSS(L_B_next, concatenate(Xtilde_i_B, Xtilde_i_F, L_init_B))
        L_B_next = L_B_i

    return L_final = L_B_1

Key implementation recommendations include segment/window size $t = w \approx 100$ , latent block size $l = t$ , overlapping tokens at segment boundaries, independent forward/backward projection heads, and AdamW optimization with fine-tuned learning rates and hyperparameters (Leotescu et al., 2024).

4. Computational Efficiency

LBA achieves substantial improvements in runtime and memory usage relative to full self-attention. For BLRP:

Full self-attention: $O(N^2 d)$ time, $O(N^2)$ memory.
Local attention (LBA): For $T = N/t$ segments each attending a window $w \approx t$ , $O(N t d)$ total time.
Cross-attention: $O(N l d)$ ; if $l = O(t)$ , overall $O(N t d)$ , nearly linear in $N$ for fixed $t$ .

This scaling allows efficient processing of very long sequences ( $N$ up to 16k), with only linear memory growth, in contrast to quadratic blowup for vanilla Transformers (Leotescu et al., 2024).

5. Empirical Performance and Ablation Studies

LBA contributes significant gains in accuracy and modeling fidelity across tasks.

BLRP (Long-Range-Arena Text Benchmarks):

ListOps accuracy: 41.43% (Longformer: 37.5%, TLB: 38.2%)
Text classification: 82.83% (Longformer: 66.0%, TLB: 82.08%)
Retrieval: 83.43% (Longformer: 81.79%, TLB: 76.91%)

Ablations confirm LBA’s effectiveness:

Small segment sizes (e.g., $t=10$ ) degrade performance ( $<$ 38% accuracy).
Bidirectional updates outperform unidirectional passes, evidenced by ListOps accuracy drops into the high 39s for unidirectional variants.
Linear scaling up to sequence lengths $N=16\,000$ without GPU memory exhaustion (Leotescu et al., 2024).

Automatic Dubbing (Cross-lingual Multi-scale Style Transfer):

Mel-spectrogram MSE (en→zh): FastSpeech 2 baseline 4.694, duration-transfer 3.695, multi-scale transfer with LBA 1.392.
MOS for style: None 3.16±0.08, duration-only 3.92±0.07, multi-scale (GST+LST via LBA) 4.12±0.07.
User preference: Multi-scale LBA solution preferred by 65.5%.
Removing local LBA raises mel-MSE from 1.39 to 1.68, confirming its necessity for style fidelity (Li et al., 2023).

6. Architectural Variants and Cross-Task Generalization

LBA enables versatile pattern modeling. In BLRP, short-range dependencies are modeled with sliding windows, while recurrent latent blocks synthesize context bidirectionally. In cross-lingual frameworks, LBA mediates word-level style and text correspondence across languages by simultaneously computing source-to-target and target-to-source attention and enabling joint training objectives.

A shared pair of projection networks and attention blocks enforce a unified semantic-style mapping. Bidirectionality is realized not only by using both sequence orientations, but also by tying the model parameters and alignment matrices across translation directions.

7. Practical Recommendations and Limitations

Select window/segment sizes matched to the task domain, typically $t=w=100$ for long-text or word-level applications. Latent block size should equal segment length ( $l=t$ ) for optimal accuracy. Utilize two independent dynamic-projection heads for forward and backward passes. When implementing, include overlapping tokens at segment boundaries to mitigate loss of context. Depth per layer should consist of two local self-attention operations and two cross-attentions. AdamW optimizer with batch size 32, embedding size 64, hidden size 128, and 8 attention heads is recommended for LRA tasks; learning rate $4 \times 10^{-4}$ , $\beta_1=0.9$ , $\beta_2=0.98$ , $\epsilon=10^{-9}$ , weight decay $0.8$, and linear decay scheduling yield strong convergence (Leotescu et al., 2024).

A plausible implication is that LBA’s local partitioning is sensitive to hyperparameter choices, particularly segment size and overlap. Under-sizing severely limits expressiveness, whereas excessive overlap or overly large windows may erode the method’s scalability. Modelers should balance locality against latent block dimension to avoid performance collapse.

In summary, Local Bidirectional Attention integrates local, context-sensitive modeling with bidirectional latent-space aggregation, demonstrably improving long-sequence parsing efficiency and performance as well as multimodal cross-lingual alignment for tasks requiring rich local structure and global context synthesis (Leotescu et al., 2024, Li et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Bidirectional Long-Range Parser for Sequential Data Understanding (2024)

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Local Bidirectional Attention (LBA).