MoGA: Mixture-of-Groups Attention

Updated 22 October 2025

MoGA is a sparse, semantic-aware attention mechanism that uses learnable token routing to group tokens, reducing full self-attention's quadratic complexity.
It integrates with high-performance kernels like FlashAttention and supports distributed processing for long-context and high-resolution video generation.
MoGA employs auxiliary losses for balanced token grouping, ensuring both global semantic coherence and preservation of local spatiotemporal details.

Mixture-of-Groups Attention (MoGA) is a sparse, semantic-aware attention mechanism for scaling transformers to extremely long sequences, introduced to address computational inefficiencies in full self-attention—particularly in high-dimensional video generation tasks. MoGA leverages learnable token routing to allocate tokens into groups and computes self-attention within these groups, yielding efficiency gains while preserving or improving sequence-level coherence. The framework draws conceptual lineage from mixture-of-experts and mixture-of-attentive-experts literature, yet is defined by its token-level learnable routing and compatibility with advanced attention kernels and distributed processing (Jia et al., 21 Oct 2025).

1. Core Mechanism: Learnable Token Routing and Grouped Attention

At the heart of MoGA lies a trainable token router. For each input token $x \in \mathbb{R}^d$ , the router applies a linear transformation, producing routing scores $r = \text{Router}(x)$ . A softmax converts these into per-group probabilities:

$p(i|x) = \text{softmax}(r)_i, \quad \forall i \in \{1, \ldots, M\}$

where $M$ is the number of groups. Each token is assigned to the group with maximal probability:

$g(x) = \arg\max_{i} p(i|x)$

Self-attention is then performed within each group:

$\text{MoGA}(x) = p(g(x)|x) \cdot SA(q, K_{g(x)}, V_{g(x)})$

where $q$ is the token’s projected query, and $K_{g(x)}$ , $V_{g(x)}$ are the keys and values for the group to which $x$ is assigned. This transforms the standard quadratic computation ( $O(N^2)$ ) to approximately $O(N^2/M)$ , under uniform group assignments. The router’s trainable parameters serve as cluster centers, shaping groups according to semantic affinities in feature space (Jia et al., 21 Oct 2025).

2. Motivation and Theoretical Underpinnings

The inefficiency of full attention, scaling as $O(N^2)$ for $N$ tokens, limits its adoption in minute-scale high-resolution video generation. Empirical observations in existing literature demonstrate considerable sparsity in optimal attention maps: only a small fraction of query-key pairs exert significant influence on model output (Jia et al., 21 Oct 2025). MoGA’s learnable router is architected to exploit redundancies by matching semantically related (or contextually similar) tokens—effectively discovering groups where intra-group attention preserves crucial dependencies.

Conceptually, this approach can be viewed as extending mixture-of-experts or the Mixture of Attentive Experts (MAE) (Peng et al., 2020) paradigm from head-level to token-group level, where the router acts akin to a gating network, but at the granularity of input token features. Unlike static sparse attention (block-wise or fixed patterns), MoGA’s group assignments are input-adaptive and parameterized, supporting specialization and dynamic composition.

3. Integration with FlashAttention and Parallelism

MoGA is engineered for compatibility with high-performance attention kernels:

FlashAttention Integration: After token routing assignments are computed, tokens are permuted such that those sharing a group become contiguous in memory. FlashAttention or equivalent kernels are then applied within each block/group independently. Outputs are inverse-permuted to the original order.
Sequence Parallelism: The router’s group assignments are computed and broadcasted across worker nodes prior to distributed attention computation, enabling each group/block of tokens to be processed independently. The group-based structure harmonizes with gather-scatter communication patterns commonly used in transformer parallelism frameworks.

This modular integration strategy ensures that the computational and memory gains of MoGA extend to large-scale training infrastructure and long-context deployment (Jia et al., 21 Oct 2025).

4. Auxiliary Mechanisms: Group Balancing and Local Group Attention

To mitigate collapse (disproportionate assignment of tokens to only a few groups), MoGA introduces a group balancing auxiliary loss:

$L_{gb} = \alpha M \sum_{i=1}^M F_i P_i$

where $F_i$ is the fraction of tokens assigned to group $i$ , $P_i$ is the average routing probability for group $i$ , $M$ is the number of groups, and $\alpha$ is a tunable hyperparameter. This loss drives the router towards uniform token distribution across groups, supporting efficient exploitation of parallel resources.

Additionally, MoGA coexists with local spatiotemporal group attention (STGA). STGA clusters tokens by spatial and temporal locality (e.g., patches within a frame and adjacent frames) for separate intra-group attention, preserving fine-grained local dependencies. In practice, outputs from MoGA and STGA are often averaged, producing richer representations that capture both global semantic and local visual coherence (Jia et al., 21 Oct 2025).

5. Applications and Experimental Validation

MoGA is validated primarily in end-to-end long video generation via Diffusion Transformers (DiTs). Its routing and group attention mechanism enable processing of context windows up to approximately 580k tokens (minute-level, $480p$, $24$ fps video):

Quantitative experiments (see Table 1, Table 2, Table 3 in the referenced work) demonstrate that MoGA achieves parity or improvement over dense attention and other sparse baselines in subject consistency, motion smoothness, and background quality, even at high sparsity levels ( $71.25\%$ redundancy removed).
In multi-shot generation, MoGA outperforms keyframe-initialized or segmented approaches in maintaining within-shot details and cross-shot coherence.
Qualitative results show MoGA’s ability to maintain character identity across temporally distant frames and handle significant appearance variations.
Ultra-long generation scenarios (e.g., generating over 1,400 frames in a video) confirm MoGA’s ability to preserve narrative and visual structure over very long sequences.

MoGA extends the principles of previous mixture-style attention and token grouping models:

Method	Routing Granularity	Specialization	Dynamicity
MAE (Peng et al., 2020)	Head-level grouping	Per-input gating	Static grouping
GroupMixAttention (Ge et al., 2023)	Multi-scale token & group proxies	Fixed group kernels	Static grouping
mixSGA (Song et al., 16 Jun 2025)	Token-wise MoE, dynamic KV group	Token scoring and assignment	Learned, per-token
MoGA (Jia et al., 21 Oct 2025)	Token, data-driven group	Learnable semantic clustering	Fully dynamic

MoGA’s technical distinction lies in its lightweight, input-adaptive router that performs semantic-aware token grouping, supporting flexible specialization and significant computational savings, and its explicit compatibility with high-efficiency attention kernels.

7. Technical Significance, Impact, and Limitations

MoGA addresses a critical scaling barrier in transformer models for long-sequence tasks, enabling minute-level, multi-shot, high-resolution, end-to-end video generation with context sizes previously infeasible for dense or blockwise sparse attention. By reducing the effective complexity from $O(N^2)$ to $O(N^2/M)$ , MoGA opens new domains within generative modeling and sequence processing.

Its strengths stem from:

Semantic clustering of tokens via a learnable router, enabling long-range interaction while maintaining computational sparsity.
Seamless integration with established kernel and parallel processing techniques.
Empirical improvements not only in computational metrics but also subject-level video consistency and global coherence under challenging sequence lengths.

A potential limitation of the approach is the added complexity from training the router and balancing group assignments, as well as possible loss in performance if semantically disparate tokens are co-localized by the routing mechanism. The method’s effectiveness depends on careful calibration of router capacity, balancing losses, and group count.

Overall, MoGA represents a distinctive advance in the practical and theoretical scaling of attention mechanisms for long-context sequence modeling, offering a blueprint for further innovations in sparse, adaptive-attention transformer architectures (Jia et al., 21 Oct 2025).