Random Batch Attention for Scalable Transformers
- Random Batch Attention (RBA) is a family of mechanisms that partition tokens into random batches to perform localized self-attention for scalability.
- RBA reduces computational and memory complexity by limiting attention to local groups, enabling linear time complexity and effective GPU parallelism.
- Empirical studies show that RBA enhances performance in graph learning and semantic segmentation while preserving the expressive power of full attention.
Random Batch Attention (RBA) denotes a family of attention mechanisms that enhance computational efficiency and scalability in neural architectures—especially Transformers—by partitioning data into randomly selected batches and applying localized attention within each partition. RBA approaches, including those formalized through Random Batch Methods (RBM) from computational mathematics, deliver theoretically bounded approximations to full self-attention, retain the expressive power of global attention under expectation, and enable new parallelism strategies for large-scale models. In recent literature, RBA frameworks have been concretely realized in both graph representation learning (Liu et al., 8 Nov 2025) and domain-generalized semantic segmentation tasks via intra-batch attention (Sun et al., 2023).
1. Formalization and Core Mechanisms
Let represent the input token matrix for a Transformer block, where is the number of tokens and their embedding dimension. Standard self-attention computes dense interactions via a affinity matrix:
with , and denoting standard linear projections.
RBA departs from this paradigm by randomly partitioning the tokens into disjoint batches , each of size . On each batch , RBA computes local self-attention:
Batch outputs are then concatenated and cropped (to manage padding) to restore the original ordering, yielding . Formally, the expectation over all possible random partitions defines the RBA operator as:
where projects outputs back to the canonical token order, and are block-diagonal matrices collecting the respective projections over all batches in (Liu et al., 8 Nov 2025).
2. Theoretical Properties and Complexity
RBA is motivated as both a mathematical sparsification of attention and a practical computational scheme:
- Time Complexity: Standard self-attention is . RBA reduces this to , which is ; when batch size , complexity is linear in .
- Memory Complexity: Dense attention stores an affinity matrix (). RBA requires only () for local affinities, plus for activations.
- Parallelization: Each batch is processed independently, enabling allocation across multiple GPUs or compute units, reducing per-worker memory and scaling throughput nearly linearly with (ignoring communication overhead) (Liu et al., 8 Nov 2025).
Theoretical analysis based on stochastic differential equations (SDEs) for interacting particle systems yields an error bound: for mean-square error per token, , vanishing as . This quantifies the closeness of RBA outputs to those of full attention both in expectation and variance.
3. Relationship to Intra-Batch Attention and Architectural Variants
The RBA paradigm is further instantiated in intra-batch attention schemes for domain-generalized semantic segmentation (Sun et al., 2023). Here, attention leverages relationships not just within tokens of a single sample but across independent samples in a batch:
- Mean-based Intra-Batch Attention (MIBA): Each instance's queries attend to the mean feature representation of other batch members. With samples per batch, the auxiliary reference for sample is . This is followed by standard attention over tokens, where and come from .
- Element-wise Intra-Batch Attention (EIBA): Queries of each sample attend element-wise to every other batch sample, inducing higher computational complexity ( per block) but greater cross-sample expressivity.
Both approaches inject cross-sample contextual information, improving feature diversity and model generalization in domain shift settings. These intra-batch strategies are algorithmically and conceptually direct realizations of RBA; they differ by the manner and granularity with which random batch context is incorporated. MIBA offers a more efficient (linear in ) variant, EIBA provides finer-grained but costlier cross-sample interactions (Sun et al., 2023).
4. Empirical Performance and Parallelism
Empirical results highlight the benefits of RBA for large-scale graph data:
- On the ogbn-arxiv dataset, Graph Transformers with RBA (RB-SGFormer) slightly exceed dense-attention baselines in node classification accuracy: 72.90% vs. 72.63%.
- On pokec, accuracy increases from 73.76% to 75.13% with RBA.
- On large-scale benchmarks such as ogbn-papers100M, RBA enables multi-GPU training and inference with memory demands compatible with distributed settings; standard dense attention causes out-of-memory errors (Liu et al., 8 Nov 2025).
For domain-generalized semantic segmentation, integrating MIBA/EIBA into SegFormer yields substantial improvements in mean IoU (mIoU) on standard transfers such as GTAV Cityscapes/BDD/Mapillary (e.g., EIBA achieves +3.90 mIoU over baseline). Performance peaks at moderate batch sizes (4–8), and architectural ablations indicate that introducing intra-batch attention at early feature extraction stages is most beneficial (Sun et al., 2023).
5. Comparison with Other Efficient Attention Mechanisms
RBA is distinguished from other efficient or sparse attention mechanisms in several respects:
- Preservation of Nonlinearity: RBA retains the original softmax structure, only restricting the attention scope via random subsampling, unlike methods such as Performer or BigBird, which modify the softmax or kernelize the attention.
- Expressivity Guarantees: Convergence proofs ensure that RBA maintains full-attention expressivity both in expectation and with vanishing mean-square error as batch size increases, a property not always established in other sparsified or approximate approaches.
- Natural Parallelization: The independence of random batches enables efficient distribution and memory savings across compute devices.
A summary comparison is presented below:
| Mechanism | Memory Complexity | Maintains Softmax | Theoretical Bound |
|---|---|---|---|
| Dense Attn | O(N²) | Yes | Exact |
| RBA | O(Np) | Yes | Mean-square approx. |
| BigBird | O(N√N) | No (block+rand) | (task-dependent) |
| Performer | O(Nd) | No (kernelized) | Approximate |
6. Interpretive and Future Directions
RBA extends naturally to multi-head and cross-attention in LLMs, and the underlying particle-system viewpoint invites further research into permutation-invariance and stability of batch partitioning. Communication-optimized implementations of random-batch parallelism are an active area, as is the potential union of RBA with kernelized or low-rank approximations for extremely long sequence modeling.
In the context of domain-generalization, the success of intra-batch attention variants suggests that broader classes of random batch strategies—involving distinct sample groupings and global context infusion—could benefit other vision or sequence tasks. The main practical constraint remains the trade-off between batch size, computational cost, and cross-sample context richness. A plausible implication is that hybrid models, combining RBA with other approximation techniques, could achieve further gains in scalability and generalization (Liu et al., 8 Nov 2025, Sun et al., 2023).