Papers
Topics
Authors
Recent
Search
2000 character limit reached

Random Batch Attention for Scalable Transformers

Updated 26 February 2026
  • Random Batch Attention (RBA) is a family of mechanisms that partition tokens into random batches to perform localized self-attention for scalability.
  • RBA reduces computational and memory complexity by limiting attention to local groups, enabling linear time complexity and effective GPU parallelism.
  • Empirical studies show that RBA enhances performance in graph learning and semantic segmentation while preserving the expressive power of full attention.

Random Batch Attention (RBA) denotes a family of attention mechanisms that enhance computational efficiency and scalability in neural architectures—especially Transformers—by partitioning data into randomly selected batches and applying localized attention within each partition. RBA approaches, including those formalized through Random Batch Methods (RBM) from computational mathematics, deliver theoretically bounded approximations to full self-attention, retain the expressive power of global attention under expectation, and enable new parallelism strategies for large-scale models. In recent literature, RBA frameworks have been concretely realized in both graph representation learning (Liu et al., 8 Nov 2025) and domain-generalized semantic segmentation tasks via intra-batch attention (Sun et al., 2023).

1. Formalization and Core Mechanisms

Let X∈RN×dX\in\mathbb{R}^{N\times d} represent the input token matrix for a Transformer block, where NN is the number of tokens and dd their embedding dimension. Standard self-attention computes dense interactions via a N×NN\times N affinity matrix:

A(X)=Softmax(XWX⊤)XW^A(X) = \mathrm{Softmax}(X W X^\top) X \hat{W}

with W=WQWK⊤/dW = W_Q W_K^\top/\sqrt{d}, W^=WVWO\hat{W}= W_V W_O and WQ,WK,WV,WOW_Q, W_K, W_V, W_O denoting standard linear projections.

RBA departs from this paradigm by randomly partitioning the NN tokens into nn disjoint batches C1,…,CnC_1,\ldots,C_n, each of size p≈N/np\approx N/n. On each batch CC, RBA computes local self-attention:

QC=XCWQ,KC=XCWK,VC=XCWVQ_C = X_C W_Q,\quad K_C = X_C W_K,\quad V_C = X_C W_V

AC=Softmax(QCKC⊤)VC,XC′=ACWOA_C = \mathrm{Softmax}(Q_C K_C^\top) V_C,\quad X_C' = A_C W_O

Batch outputs XC′X_C' are then concatenated and cropped (to manage padding) to restore the original ordering, yielding X′∈RN×dX'\in\mathbb{R}^{N\times d}. Formally, the expectation over all possible random partitions π\pi defines the RBA operator as:

RBA(X)=Eπ[Pπ(Softmax(QπKπ⊤)Vπ)]\mathrm{RBA}(X) = \mathbb{E}_{\pi} [\mathcal{P}_{\pi} ( \mathrm{Softmax}(Q_{\pi} K_{\pi}^\top) V_{\pi} ) ]

where Pπ\mathcal{P}_{\pi} projects outputs back to the canonical token order, and Qπ,Kπ,VπQ_\pi, K_\pi, V_\pi are block-diagonal matrices collecting the respective projections over all batches in π\pi (Liu et al., 8 Nov 2025).

2. Theoretical Properties and Complexity

RBA is motivated as both a mathematical sparsification of attention and a practical computational scheme:

  • Time Complexity: Standard self-attention is O(N2d)O(N^2 d). RBA reduces this to O(Npd)O(N p d), which is O(N2d/n)O(N^2 d / n); when batch size p≪Np\ll N, complexity is linear in NN.
  • Memory Complexity: Dense attention stores an N×NN\times N affinity matrix (O(N2)O(N^2)). RBA requires only O(np2)O(n p^2) (=O(Np)=O(Np)) for local affinities, plus O(Nd)O(Nd) for activations.
  • Parallelization: Each batch CC is processed independently, enabling allocation across multiple GPUs or compute units, reducing per-worker memory and scaling throughput nearly linearly with nn (ignoring communication overhead) (Liu et al., 8 Nov 2025).

Theoretical analysis based on stochastic differential equations (SDEs) for interacting particle systems yields an error bound: for mean-square error per token, O(1/(p−1))\mathcal{O}(1/(p-1)), vanishing as p→Np\rightarrow N. This quantifies the closeness of RBA outputs to those of full attention both in expectation and variance.

3. Relationship to Intra-Batch Attention and Architectural Variants

The RBA paradigm is further instantiated in intra-batch attention schemes for domain-generalized semantic segmentation (Sun et al., 2023). Here, attention leverages relationships not just within tokens of a single sample but across independent samples in a batch:

  • Mean-based Intra-Batch Attention (MIBA): Each instance's queries attend to the mean feature representation of other batch members. With BB samples per batch, the auxiliary reference for sample ii is F^li=1B−1(∑j=1BFjl−Fil){\hat F^l}_i = \frac{1}{B-1} (\sum_{j=1}^B F^l_j - F^l_i). This is followed by standard attention over (B×N)(B\times N) tokens, where KK and VV come from F^l\hat F^l.
  • Element-wise Intra-Batch Attention (EIBA): Queries of each sample attend element-wise to every other batch sample, inducing higher computational complexity (O(B2N2d)O(B^2 N^2 d) per block) but greater cross-sample expressivity.

Both approaches inject cross-sample contextual information, improving feature diversity and model generalization in domain shift settings. These intra-batch strategies are algorithmically and conceptually direct realizations of RBA; they differ by the manner and granularity with which random batch context is incorporated. MIBA offers a more efficient (linear in BB) variant, EIBA provides finer-grained but costlier cross-sample interactions (Sun et al., 2023).

4. Empirical Performance and Parallelism

Empirical results highlight the benefits of RBA for large-scale graph data:

  • On the ogbn-arxiv dataset, Graph Transformers with RBA (RB-SGFormer) slightly exceed dense-attention baselines in node classification accuracy: 72.90% vs. 72.63%.
  • On pokec, accuracy increases from 73.76% to 75.13% with RBA.
  • On large-scale benchmarks such as ogbn-papers100M, RBA enables multi-GPU training and inference with memory demands compatible with distributed settings; standard dense attention causes out-of-memory errors (Liu et al., 8 Nov 2025).

For domain-generalized semantic segmentation, integrating MIBA/EIBA into SegFormer yields substantial improvements in mean IoU (mIoU) on standard transfers such as GTAV →\rightarrow Cityscapes/BDD/Mapillary (e.g., EIBA achieves +3.90 mIoU over baseline). Performance peaks at moderate batch sizes (4–8), and architectural ablations indicate that introducing intra-batch attention at early feature extraction stages is most beneficial (Sun et al., 2023).

5. Comparison with Other Efficient Attention Mechanisms

RBA is distinguished from other efficient or sparse attention mechanisms in several respects:

  • Preservation of Nonlinearity: RBA retains the original softmax structure, only restricting the attention scope via random subsampling, unlike methods such as Performer or BigBird, which modify the softmax or kernelize the attention.
  • Expressivity Guarantees: Convergence proofs ensure that RBA maintains full-attention expressivity both in expectation and with vanishing mean-square error as batch size increases, a property not always established in other sparsified or approximate approaches.
  • Natural Parallelization: The independence of random batches enables efficient distribution and memory savings across compute devices.

A summary comparison is presented below:

Mechanism Memory Complexity Maintains Softmax Theoretical Bound
Dense Attn O(N²) Yes Exact
RBA O(Np) Yes Mean-square approx.
BigBird O(N√N) No (block+rand) (task-dependent)
Performer O(Nd) No (kernelized) Approximate

6. Interpretive and Future Directions

RBA extends naturally to multi-head and cross-attention in LLMs, and the underlying particle-system viewpoint invites further research into permutation-invariance and stability of batch partitioning. Communication-optimized implementations of random-batch parallelism are an active area, as is the potential union of RBA with kernelized or low-rank approximations for extremely long sequence modeling.

In the context of domain-generalization, the success of intra-batch attention variants suggests that broader classes of random batch strategies—involving distinct sample groupings and global context infusion—could benefit other vision or sequence tasks. The main practical constraint remains the trade-off between batch size, computational cost, and cross-sample context richness. A plausible implication is that hybrid models, combining RBA with other approximation techniques, could achieve further gains in scalability and generalization (Liu et al., 8 Nov 2025, Sun et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Batch Attention (RBA).