Query-Broadcast Attention
- Query-broadcast attention is a mechanism that contracts token representations into a compact set and broadcasts critical information back to the full sequence for efficiency.
- The CBSA method employs a two-step block-coordinate process—contraction and broadcast—to achieve linear computational complexity and improved interpretability.
- In diffusion models, Pyramid Attention Broadcast reuses attention outputs across timesteps, significantly reducing redundant computation in video generation.
Query-broadcast attention encompasses a family of attention mechanisms that compress or cache attention computations—typically by contracting token representations into a compact set and then broadcasting information back to the full sequence—to improve efficiency and interpretability. Prominent methods include Contract-and-Broadcast Self-Attention (CBSA) for efficient representation learning in transformers and Pyramid Attention Broadcast (PAB) for minimizing redundant attention computation in diffusion-based video generation. Both approaches exemplify query-broadcast dynamics: a small set of queries (or representatives) are computed or identified, attention is computed on these, and their outputs are broadcast back, reducing the need for frequent or full attention computation.
1. Contract-and-Broadcast Self-Attention (CBSA): Theoretical Underpinnings
CBSA is derived from a unified optimization perspective, starting with a maximal coding-rate reduction (MCR²) objective that seeks to compress token sets into a compact yet informative representation. This approach explicitly formalizes the goals of interpretability and efficiency within self-attention layers.
The MCR² objective for tokens is given as: where spans incoherent -dimensional subspaces and is a coding-rate function based on log-determinant regularization.
CBSA introduces a small set of representatives per subspace, enforcing that compressing these provides equivalent information as contracting all tokens in that subspace. This yields the regularized objective: with the contraction coefficients (Wen et al., 21 Sep 2025).
2. Algorithmic Realization: Contract–and–Broadcast Flow
CBSA unrolls a two-step block-coordinate optimization—contract and broadcast—into a purely feedforward operator:
- Contraction: Project each token block into its -dimensional head subspace, extract representatives via cross-attention, then perform self-attention-based contraction over them. The contraction per head is:
- Broadcast: Broadcast the contracted information across all tokens using the attention coefficient transpose:
The final CBSA transformation is: where contraction and broadcast are computed per attention head (Wen et al., 21 Sep 2025).
3. Complexity and Generalization
CBSA achieves linear complexity in the number of tokens for practical parameter regimes , as compared to quadratic complexity in softmax self-attention. Specifically:
- CBSA per-layer complexity: .
- Memory usage: for activations, versus for full softmax attention.
- Special cases: CBSA recovers full softmax (when and ), linear attention (when are principal components of ), and channel attention (when are fixed orthonormal bases) (Wen et al., 21 Sep 2025).
4. Broadcast Attention and Redundancy in Diffusion
Broadcast attention has further applications in iterative architectures like diffusion models, where per-step computations exhibit high redundancy. In Pyramid Attention Broadcast (PAB) for DiT-based video diffusion, broadcast attention directly reuses the output of an attention layer over multiple timesteps, conditioned on measured redundancy.
- Formal definition: If is the output at broadcast origin , then for , set instead of recomputing attention at each step.
- Redundancy diagnosis: The per-step output change is empirically U-shaped, with high variance at the ends but minimal in the central 70% of diffusion steps (Zhao et al., 22 Aug 2024).
5. Pyramid-Style and Distributed Broadcast Mechanisms
PAB adapts broadcast window sizes hierarchically according to attention type:
- Spatial attention: Shortest broadcast range ()
- Temporal attention: Intermediate ()
- Cross attention: Longest ()
The algorithm proceeds as follows:
1 2 3 4 5 6 7 |
for t in 1..T: for each attention type m in {S,T,C}: if stable_start ≤ t ≤ stable_end and (t - last_compute[m]) ≤ B[m]: O_m[t] = O_m[last_compute[m]] # broadcast else: O_m[t] = Attention_m(X_t) # full compute last_compute[m] = t |
PAB further extends to distributed settings (broadcast sequence parallel), eliminating inter-GPU communication during broadcasted temporal attention steps, reducing required bandwidth and increasing throughput (Zhao et al., 22 Aug 2024).
6. Empirical Findings and Comparative Performance
CBSA
- Compression via representatives: Coding-rate progressively decreases over CBSA layers, well-correlated with classification accuracy; contraction over a handful of representatives suffices.
- Role of broadcast: Ablation experiments removing the broadcast step (i.e., using identity for ) result in considerable accuracy degradation.
- Emergent segmentation and robustness: Early layers show emergent object-segmentation properties; model is robust to perturbations of the basis .
- Efficiency: On ImageNet-1K, a CBSA-Small model matches ViT-Small accuracy (≈71.4%) with ∼40% of the pairwise similarity operations of standard attention.
PAB
- Speedup: On single-GPU (Open-Sora, 30 steps, 480p video), PAB with (2,4,6) broadcasting achieves 1.34× acceleration with negligible quality drop (<1% VBench loss). More aggressive ranges (3,5,7) and (5,7,9) further improve speed at slight cost to quality.
- Scaling: On 8×H100 GPUs, broadcast sequence parallel yields 10.6× end-to-end speedup, with ∼50% communication reduction.
- Latencies: Attention operations constitute 10–20% of runtime, with attention-related overhead (norm, proj, reshape) ∼30%. PAB eliminates most of this overhead in redundant steps (Zhao et al., 22 Aug 2024).
7. Connections, Special Cases, and Implications
Both CBSA and PAB generalize classical attention via query-broadcast mechanisms:
- CBSA as a unifying framework: Varies the number of representatives, projection structure, and contraction mechanics to interpolate between full, linear, and channel attention.
- PAB in iterative models: Empirically validates that much attention activity is redundant across diffusion steps, motivating output-level reuse without retraining.
A plausible implication is that structured query-broadcast paradigms will continue to underlie both general-purpose transformers and domain-specific architectures where efficiency and interpretability are paramount. These methods further suggest that the contraction–broadcast decomposition is a powerful axis for analyzing and optimizing attention beyond mere computation reduction.
References:
- "Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few" (Wen et al., 21 Sep 2025)
- "Real-Time Video Generation with Pyramid Attention Broadcast" (Zhao et al., 22 Aug 2024)