Local Bidirectional Attention (LBA)
- Local Bidirectional Attention (LBA) is an attention mechanism that integrates local sliding-window processing with bidirectional latent synthesis to capture both short-range and global dependencies.
- It employs local and cross-attention blocks to efficiently model sequential data and multimodal correspondences, reducing computational complexity compared to full self-attention.
- Empirical studies demonstrate LBA’s effectiveness in boosting accuracy for long-range text parsing and enhancing cross-lingual style transfer in automatic dubbing.
Local Bidirectional Attention (LBA) is an attention mechanism that synthesizes local and bidirectional context, enabling efficient and expressive modeling of sequential data and multimodal word-level correspondences. It is employed in high-efficiency long-range parsers such as BLRP for textual and vision data (Leotescu et al., 2024), as well as in cross-lingual style transfer frameworks for automatic dubbing (Li et al., 2023). LBA constructs local sliding-window or local cross-lingual attention blocks and aggregates bidirectional latent representations, capturing both short-range dependencies and global structure with computational scalability.
1. Formal Definition and Core Principles
LBA builds on the principle of local attention, restricting each input segment’s receptive field to a short neighborhood, and bidirectional passes, whereby local features are synthesized in both forward and backward directions and merged through a global latent representation. This is instantiated in BLRP by partitioning the input into non-overlapping segments of length and constructing, for each segment , a window containing tokens from adjacent segments:
where are the last tokens of the previous segment, and are the first tokens of the next segment. Standard scaled-dot-product self-attention is applied within each window.
In cross-lingual sequence modeling (Li et al., 2023), LBA takes two word-level feature matrices (source and target sequences), projects keys and values for each, and computes a shared bidirectional attention matrix:
Row-wise softmax yields attention weights in both directions, supporting simultaneous fusion of multimodal features across languages.
2. Mathematical Formulation
BLRP Local Sliding-Window Attention
For each segment , queries, keys, and values are:
The local attention output is:
with each pre-softmax element:
Softmax normalization over window positions yields attention weights :
Bidirectional Latent-Space Synthesis
A latent block is updated via two passes:
- Forward:
- Backward:
Each cross-attention step uses scaled-dot-product attention analogous to self-attention.
Cross-Lingual Bidirectional Attention
Local bidirectional attention between source and target sequences uses two sets of keys/values :
- yields source target alignment.
- Summarized outputs: and Concatenated with textual features and projected to generate predicted local style tokens.
3. Algorithm and Implementation
Pseudocode for BLRP’s LBA module:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def BLRP_LocalBidirectional(X): segments = split(X, t) L_init_F = Theta_proj_forward(X) L_init_B = Theta_proj_backward(X) L_F_prev = L_init_F L_B_next = L_init_B # Forward pass for i in range(1, T+1): W_i = get_window(segments, i, w) S_i = Softmax((q^S(segments[i]) @ k^S(W_i).T) / sqrt(d)) @ v^S(W_i) Xtilde_i_F = Θ_X^CROSS(S_i, L_F_prev) L_F_i = Θ_L^CROSS(L_F_prev, concatenate(Xtilde_i_F, L_init_F)) L_F_prev = L_F_i # Backward pass for i in reversed(range(1, T+1)): W_i = get_window(segments, i, w) S_i = Softmax((q^S(segments[i]) @ k^S(W_i).T) / sqrt(d)) @ v^S(W_i) Xtilde_i_B = Θ_X^CROSS(S_i, L_B_next) L_B_i = Θ_L^CROSS(L_B_next, concatenate(Xtilde_i_B, Xtilde_i_F, L_init_B)) L_B_next = L_B_i return L_final = L_B_1 |
Key implementation recommendations include segment/window size , latent block size , overlapping tokens at segment boundaries, independent forward/backward projection heads, and AdamW optimization with fine-tuned learning rates and hyperparameters (Leotescu et al., 2024).
4. Computational Efficiency
LBA achieves substantial improvements in runtime and memory usage relative to full self-attention. For BLRP:
- Full self-attention: time, memory.
- Local attention (LBA): For segments each attending a window , total time.
- Cross-attention: ; if , overall , nearly linear in for fixed .
This scaling allows efficient processing of very long sequences ( up to 16k), with only linear memory growth, in contrast to quadratic blowup for vanilla Transformers (Leotescu et al., 2024).
5. Empirical Performance and Ablation Studies
LBA contributes significant gains in accuracy and modeling fidelity across tasks.
BLRP (Long-Range-Arena Text Benchmarks):
- ListOps accuracy: 41.43% (Longformer: 37.5%, TLB: 38.2%)
- Text classification: 82.83% (Longformer: 66.0%, TLB: 82.08%)
- Retrieval: 83.43% (Longformer: 81.79%, TLB: 76.91%)
Ablations confirm LBA’s effectiveness:
- Small segment sizes (e.g., ) degrade performance (38% accuracy).
- Bidirectional updates outperform unidirectional passes, evidenced by ListOps accuracy drops into the high 39s for unidirectional variants.
- Linear scaling up to sequence lengths without GPU memory exhaustion (Leotescu et al., 2024).
Automatic Dubbing (Cross-lingual Multi-scale Style Transfer):
- Mel-spectrogram MSE (en→zh): FastSpeech 2 baseline 4.694, duration-transfer 3.695, multi-scale transfer with LBA 1.392.
- MOS for style: None 3.16±0.08, duration-only 3.92±0.07, multi-scale (GST+LST via LBA) 4.12±0.07.
- User preference: Multi-scale LBA solution preferred by 65.5%.
- Removing local LBA raises mel-MSE from 1.39 to 1.68, confirming its necessity for style fidelity (Li et al., 2023).
6. Architectural Variants and Cross-Task Generalization
LBA enables versatile pattern modeling. In BLRP, short-range dependencies are modeled with sliding windows, while recurrent latent blocks synthesize context bidirectionally. In cross-lingual frameworks, LBA mediates word-level style and text correspondence across languages by simultaneously computing source-to-target and target-to-source attention and enabling joint training objectives.
A shared pair of projection networks and attention blocks enforce a unified semantic-style mapping. Bidirectionality is realized not only by using both sequence orientations, but also by tying the model parameters and alignment matrices across translation directions.
7. Practical Recommendations and Limitations
Select window/segment sizes matched to the task domain, typically for long-text or word-level applications. Latent block size should equal segment length () for optimal accuracy. Utilize two independent dynamic-projection heads for forward and backward passes. When implementing, include overlapping tokens at segment boundaries to mitigate loss of context. Depth per layer should consist of two local self-attention operations and two cross-attentions. AdamW optimizer with batch size 32, embedding size 64, hidden size 128, and 8 attention heads is recommended for LRA tasks; learning rate , , , , weight decay $0.8$, and linear decay scheduling yield strong convergence (Leotescu et al., 2024).
A plausible implication is that LBA’s local partitioning is sensitive to hyperparameter choices, particularly segment size and overlap. Under-sizing severely limits expressiveness, whereas excessive overlap or overly large windows may erode the method’s scalability. Modelers should balance locality against latent block dimension to avoid performance collapse.
In summary, Local Bidirectional Attention integrates local, context-sensitive modeling with bidirectional latent-space aggregation, demonstrably improving long-sequence parsing efficiency and performance as well as multimodal cross-lingual alignment for tasks requiring rich local structure and global context synthesis (Leotescu et al., 2024, Li et al., 2023).