Bi-Directional Block Self-Attention
- Bi-Directional Block Self-Attention is a mechanism that partitions input into blocks to capture both local and long-range dependencies via bidirectional context propagation.
- It employs intra-block and inter-block self-attention with masked permutations and fusion gates to balance efficiency and model expressiveness.
- Empirical models like Bi-BloSAN, BlockBERT, and NBSA demonstrate competitive performance in NLP and vision tasks with significant gains in memory and speed.
Bi-directional block self-attention refers to a class of self-attention mechanisms that exploit blockwise partitioning of the input and introduce mechanisms for context propagation in both forward and backward directions. This design is motivated by the need to capture long-range dependencies while significantly reducing memory and compute costs compared to full global self-attention. Bi-directional block self-attention formalizes context exchange within small local regions (blocks) while enabling either explicit or implicit bidirectional message passing between blocks. This approach underlies models such as Bi-BloSAN for sequence modeling (Shen et al., 2018), BlockBERT for long-document transformers (Qiu et al., 2019), and nested-block architectures employed in vision and medical imaging (Veeraraghavan et al., 2021).
1. Formal Definition and Core Mechanisms
Nearly all blockwise self-attention architectures begin by partitioning an input sequence or tensor into non-overlapping or contiguous blocks of fixed size. Attention is restricted to be "sparse," i.e., most query-key pairs are masked except for those permitted by intra- or inter-block rules.
Let denote the input sequence (for images, an analogous tensor applies). Splitting into blocks of size , define block indices .
Blockwise Sparse Attention: For each attention head , construct a block adjacency permutation . The binary mask matrix is defined as: Self-attention for head then is: 0 Bidirectionality in context propagation is realized by assigning different heads or passes to different permutations, e.g., identity (self), forward shift, backward shift, or raster-scan.
Intra-block attention operates within a block, so each token in block 1 attends to all tokens in that block. Inter-block attention allows tokens to attend to representatives or contexts from other blocks.
2. Principal Algorithms and Variants
2.1 Bi-Directional Block Self-Attention Network (Bi-BloSAN)
Bi-BloSAN employs a two-stage strategy:
- Intra-block self-attention: Each block is processed using a masked (temporal) self-attention over its 2 tokens with forward or backward masks to encode sequence order.
- Inter-block self-attention: Each block output is compressed (e.g., via pooled source-to-token attention) to a vector, then these block vectors interact via a second masked self-attention, capturing long-range dependencies.
- Fusion: A gating mechanism fuses intra- and inter-block (local/global) information at each token for the final encoding.
Directionality is handled by running two parallel modules, one with forward and one with backward masks, then concatenating their outputs to yield full bidirectional context at the token level. Bi-BloSAN uses feature-level multi-dimensional attention and enables highly parallel computation (Shen et al., 2018).
2.2 Bidirectional Blockwise Attention in Transformers (BlockBERT)
BlockBERT restricts attention via 3 block masks, but achieves bidirectionality by assigning each attention head to a different shifted permutation of the block indices (4 mod 5). Tokens thus acquire context from their block, subsequent blocks, and prior blocks through the ensemble of heads, ensuring that the aggregate context available at each position is fully bidirectional, despite the sparsity (Qiu et al., 2019).
2.3 Nested-Block Self-Attention (NBSA) with Bidirectional Stream
NBSA is a plug-in for dense prediction in images. Each feature map is partitioned into 6 blocks of size 7. The process is:
- Intra-block attention within each block;
- Forward raster-scan inter-block attention (e.g., top-left to bottom-right), where each block aggregates context from its predecessors;
- Backward raster-scan attention (reverse order), propagating context from successors;
- The outputs of the three pathways are fused (via 1×1 convolution or sum) for further processing (Veeraraghavan et al., 2021).
This dual-direction sweep allows context to route around local corruptions or missing regions and provides robustness in segmentation tasks.
3. Computational Complexity and Scalability
All bi-directional block self-attention architectures are motivated by the 8 time and space bottleneck of global self-attention.
- For blockwise models with 9 blocks and block size 0:
- Time, memory: Each attention head processes only 1 blocks, each costing 2. Total cost becomes 3 for 4 heads—a strict 5 reduction over dense attention (Qiu et al., 2019).
- Bi-BloSAN memory: Comprises 6 for intra-block and 7 for inter-block; optimal 8 yields subquadratic scaling 9 (Shen et al., 2018).
- NBSA (images): Costs per pass are 0 for intra-block and inter-block passes, with 1 the block size and 2 a memory threshold. For moderate 3, this is 4, asymptotically linear in image size (Veeraraghavan et al., 2021).
Bi-directional schemes preserve a long effective receptive field, since information from all directions is exchanged over a small number of passes, and the blockwise sparsity permits operation at sequence/image lengths previously intractable for full self-attention.
4. Empirical Performance and Applications
Sequence Modeling and NLP: Bi-BloSAN achieved state-of-the-art or competitive accuracy on nine NLP benchmarks, including SNLI (85.7% vs. 85.6% for DiSAN, 85.0% for Bi-LSTM), with memory and speed advantages approaching those of RNNs/CNNs but without recurrence (Shen et al., 2018).
Document Transformers: BlockBERT achieved up to 36% memory reduction, 25% faster training, and 28% faster inference compared to RoBERTa, with less than 1 F1 point difference on SQuAD 1.1/2.0 and parity or minor gains for longer context benchmarks (SearchQA, TriviaQA, NewsQA). CPU/GPU memory scaling permitted larger batch sizes and context windows (Qiu et al., 2019).
Vision and Medical Imaging: NBSA improved segmentation Dice scores by +3–5 points on head-and-neck organ segmentation compared to criss-cross attention, with particular robustness in low-contrast and artifact-rich settings. For brain-stem, NBSA yielded Dice ≈ 0.92 vs. 0.88 for CCA. The gains extend to 3D/volumetric and other dense-prediction tasks by adapting block partitioning (Veeraraghavan et al., 2021).
5. Directionality, Information Flow, and Practical Implications
Block self-attention’s expressivity hinges on the design of context flow across blocks:
- In language, bidirectionality encodes both past- and future-context at a granular level, overcoming the limitations of strictly causal or uni-directional models.
- In vision, bi-directional raster-scan passes allow information to circumvent occlusions and propagate global context efficiently, versus axis-limited CCA or unidirectional scans.
- Multiple block permutations or passes (with distinct heads or explicit forward/backward sweeps) guarantee that, in aggregate, tokens can access both local and arbitrarily distant information from all directions.
A plausible implication is that blockwise bidirectional attention can recover much of the expressiveness of dense global self-attention while controlling compute and memory requirements to the level of RNN/CNN baselines.
6. Limitations and Extension Opportunities
Block size and partitioning: Block length is a hyperparameter affecting the local/global tradeoff—small blocks favor detail but increase inter-block bottlenecks, while large blocks reduce global coverage.
Expressivity loss: Extremely small blocks or overly aggressive pooling may bottleneck the flow of fine-grained information across distant tokens. In latent contexts where per-token detail is critical, inter-block pooling or masking may cause some signal loss (Shen et al., 2018).
Design Opportunities:
- Multi-head blockwise attention can increase representational power, allowing multiple context flows or longer-reach permutations.
- Hierarchical or nested block decompositions may further bridge the gap to global attention, especially if multi-level or adaptive blocks are used (though standard BlockBERT and NBSA do not introduce extra hierarchy or global tokens).
- For high-dimensional or multimodal data, extending blockwise bi-directional attention to non-grid partitions or 3D/volumetric contexts is straightforward (Veeraraghavan et al., 2021).
Empirical context: All cited architectures are general and can be ported to tasks in NLP, vision, and dense prediction, where bandwidth and memory demand are critical, with applications shown in question answering, segmentation, object detection, and long-text modeling.
Summary Table: Core Properties of Leading Bidirectional Block Self-Attention Variants
| Architecture | Data Modality | Block Usage | Bidirectionality Mechanism |
|---|---|---|---|
| Bi-BloSAN | Sequence | Intra/inter blocks | Forward/backward masks, fusion |
| BlockBERT | Sequence | Blockwise masking | Multi-head block shift permutations |
| NBSA | Vision | Nested/raster blocks | Forward and backward raster scan |
These mechanisms provide computational efficiency and bidirectional context spread across a variety of dense prediction and representation learning applications.