Papers
Topics
Authors
Recent
2000 character limit reached

SBM-Transformer: Data-Adaptive Sparse Attention

Updated 5 January 2026
  • SBM-Transformer is a Transformer variant that uses mixed-membership stochastic block models to generate dynamic, sparse attention masks, reducing quadratic computation.
  • It employs a straight-through estimator for differentiable stochastic sampling, enabling gradient-based training of input-dependent attention mechanisms.
  • Empirical results on benchmarks like LRA and GLUE show competitive accuracy with significantly lower memory and computational requirements.

The SBM-Transformer is a variant of the Transformer architecture that replaces expensive dense self-attention with a data-adaptive, learnable sparse attention mechanism parameterized by Stochastic Block Models (SBMs). Instead of attending to all n2n^2 possible query-key pairs, each attention head infers a sparse bipartite graph over the input tokens, drastically reducing computation while retaining expressiveness. The SBM-Transformer achieves this by learning both latent cluster assignments and inter-cluster affinities, generating input-dependent attention masks whose topology adapts to the sequence. The use of a straight-through estimator (STE) renders the stochastic edge-sampling differentiable for gradient-based learning. This attention mechanism has been shown to provide competitive or superior accuracy to full and other sparse attention variants on both synthetic and standard benchmarks, while realizing considerable reductions in computation and memory (Cho et al., 2022). The SBM-Transformer paradigm has also been adopted in other architectural contexts, such as code summarization over ASTs, confirming its broad applicability (Oh et al., 2024).

1. Data-Adaptive Attention via Mixed-Membership Stochastic Block Models

In SBM-Transformer, each multi-head attention head is paired with a mixed-membership SBM. For input feature matrix XRn×dX \in \mathbb{R}^{n \times d}, each head forms queries QQ, keys KK, and values VV as in the standard Transformer:

Q=XWQ,K=XWK,V=XWVRn×dhQ = X W^Q, \quad K = X W^K, \quad V = X W^V \in \mathbb{R}^{n \times d_h}

Each head introduces a learnable cluster embedding matrix CRk×dhC \in \mathbb{R}^{k \times d_h} for kk latent clusters, and a small MLP ϕ()\phi(\cdot). Nodes (tokens) receive soft cluster assignments:

Y=σ(ϕ(Q)CT)(0,1)n×k,Z=σ(ϕ(K)CT)(0,1)n×kY = \sigma(\phi(Q) C^T) \in (0,1)^{n \times k}, \quad Z = \sigma(\phi(K) C^T) \in (0,1)^{n \times k}

The inter-cluster affinity matrix is

B=softmax2d(CCT)(0,1)k×kB = \text{softmax}_{2d}(C C^T) \in (0,1)^{k \times k}

yielding a probability of attention (edge) between query ii and key jj:

pij=YiBZjT[0,1]p_{ij} = Y_i B Z_j^T \in [0,1]

The actual attention mask M{0,1}n×nM \in \{0,1\}^{n \times n} is sampled (e.g., using fastRG, O(m+n)O(m + n) cost). The mask MM is then used to select active query-key pairs for attention.

2. Sparse Masked Attention and Computational Advantages

Given the binary attention mask MM, the SBM-Transformer computes attention only along sampled edges. The masked attention operation is

Attn(Q,K,V;M)=σM(MS)V\text{Attn}(Q,K,V;M) = \sigma_M(M \odot S) V

where S=QKT/dhS = Q K^T / \sqrt{d_h} and σM\sigma_M is the softmax applied only where Mij=1M_{ij} = 1 (others are - \infty). This yields complexity O(m)O(m), where mm is the number of active edges, substantially lower than O(n2)O(n^2) for dense attention. The number mm is data-adaptive and input-dependent; typically O(n)O(n) or O(nlogn)O(n \log n) in practice.

3. Differentiable Sampling and Training via the Straight-Through Estimator

Attention mask sampling is non-differentiable. The SBM-Transformer handles this with the straight-through estimator (STE): in the backward pass, gradients are taken with respect to the expectation P=(pij)P=(p_{ij}), treating MM as if it were continuous. For a sampled edge (i,j)(i,j) with mask Mij=1M_{ij}=1, gradients accumulate as:

Lpij=LAijQiKjdh\frac{\partial L}{\partial p_{ij}} = \frac{\partial L}{\partial A_{ij}} \cdot \frac{Q_i \cdot K_j}{\sqrt{d_h}}

An exploration enhancement pijpij+δp_{ij} \gets p_{ij} + \delta is adopted to avoid permanently dropping edges, which ensures sufficient exploration during training.

4. Theoretical Expressiveness and Universality

The SBM-Transformer is a universal approximator of sequence-to-sequence functions in expectation. For any continuous sequence function ff, an SBM-Transformer can be constructed such that:

f(X)E[g(X)]ppdXε\int \| f(X) - \mathbb{E}[g(X)] \|_p^p dX \leq \varepsilon

for any ε>0\varepsilon > 0 and 1p<1 \leq p < \infty, given sufficient model capacity. This follows from constructing SBMs that realize various graph structures (block-diagonal, global-relay) and concatenating Hamiltonian paths, guaranteeing path-wise and global connectivity requirements for universal sequence modeling (Cho et al., 2022).

5. Empirical Performance

Extensive benchmarks in both synthetic and real-world scenarios demonstrate the competitiveness of SBM-Transformer. On Long Range Arena (LRA) tasks (sequence lengths up to 4K), SBM-Transformer achieves accuracy equal to or surpassing full-attention Transformers while using only 20–30% of the possible edges in attention masks. For instance:

Model ListOps Text Retrieval Image Pathfinder Avg.
Full-attention 37.22% 64.93% 79.55% 40.38% 74.26% 59.27%
SBM-Transformer 37.45% 65.79% 80.00% 41.31% 75.12% 59.93%
(Mask Density) (20.1%) (26.1%) (29.5%) (20.5%) (18.6%)

On GLUE, in a BERT-style setup, SBM-Transformer matches or exceeds dense and other sparse variants with 13.5% average density, e.g., F1 on SST-2: 89.8 (SBM), 89.8 (full attention). FLOP reductions of 0.070.29×0.07–0.29\times and comparable or reduced peak memory are consistently observed.

Ablations confirm the adaptive sparsity: SBM-Transformer densifies masks in harder instances and learns to specialize attention, in contrast to hand-crafted or uniform sparsifiers (Cho et al., 2022).

6. Application to Tree-Structured and Code Data

The SBM attention mechanism has been adapted for Transformers operating on Abstract Syntax Trees (ASTs) in code summarization tasks. In this context, each node–node pair is assigned an attention probability via learned node-cluster and cluster-cluster affinity matrices:

Pij=r=1ks=1kQ^i,rSr,sK^j,s=Q^SK^TP_{ij} = \sum_{r=1}^k \sum_{s=1}^k \hat Q_{i,r} S_{r,s} \hat K_{j,s} = \hat Q S \hat K^T

where Q^i,r\hat Q_{i,r} and K^j,s\hat K_{j,s} are node–cluster dot products, and SS is the symmetric block-affinity matrix. Sampling and the STE enable dynamic data-adaptive masks during training. Empirically, SBM attention in the CSA-Trans encoder yields improved summarization accuracy (BLEU-4 increase by 0.38–0.43), with 10–40% reductions in backward time and peak memory over standard and graph-based variants. Notably, attention heatmaps are found to be sparser and more interpretable, preserving non-local relationships discarded by fixed AST sparsity (Oh et al., 2024).

7. Implementation, Hyperparameters, and Limitations

The reference SBM-Transformer implementation defaults to dense operations with masked entries due to limited support for unstructured sparsity in standard deep learning libraries. A pure sparse graph-attention implementation would reveal the full computational benefits. Cluster count kk is typically set to 128. Task-dependent configurations (layer/head counts, embedding sizes) mirror those in common Transformer experiments.

Limitations include the lack of hardware-optimized sparse kernel support, which prevents realizing maximal speedups. Potential extensions identified include degree-corrected and hierarchical SBMs, dynamic adjustment of cluster count kk, and integration with block-sparse attentions.


SBM-Transformer introduces a principled, universal, and data-adaptive sparse attention mechanism, leveraging mixed-membership Stochastic Block Models for computational efficiency and increased interpretability, with demonstrated performance gains across language, vision, and structured code domains (Cho et al., 2022, Oh et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SBM-Transformer.