- The paper establishes a comprehensive characterization of functional equivalence in multihead attention by deriving symmetry groups for different positional encodings.
- The paper introduces a two-stage alignment methodology that effectively minimizes loss barriers while preserving linear mode connectivity across architectures.
- The paper demonstrates that the use of RoPE reduces parameter redundancy, enhancing expressivity and generalization in various model scales and tasks.
Functional Equivalence in Attention and Its Implications for Linear Mode Connectivity
The paper "Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity" (2606.17830) presents a rigorous analysis of the parameter symmetries in multihead attention (MHA), emphasizing the critical influence of positional encodings (PEs), especially sinusoidal and Rotary Positional Encoding (RoPE), on both theoretical functional equivalence and empirical linear mode connectivity (LMC). This work establishes the most comprehensive characterization of functional equivalence for attention-based models to date and delineates its ramifications for model alignment and interpolation in parameter space.
Theoretical Framework: Parameter Symmetry and Functional Equivalence in Attention
The inherent non-injectivity of neural parameterizations means distinct parameter vectors may realize identical input–output functions. For standard MHA without PEs, functional equivalence is governed by permutations of attention heads and independent invertible transformations of each head's internal subspace — these symmetries are encoded in the group GAtt(dh,h)=Sh×(GL(dh)×GL(dh))h. The authors invoke a generic identifiability theorem: outside a measure-zero set, two MHA parameterizations represent the same function if and only if related by an element of GAtt.
In the presence of sinusoidal absolute positional encoding, the symmetry group remains unaltered. Since sinusoidal PE applies an invertible shift to inputs, the functional equivalence classes coincide precisely with the vanilla case. Thus, the addition of sinusoidal PE preserves the architectural symmetries and injects no new constraints on parameter re-mappings.
By contrast, RoPE, the preeminent relative encoding in recent architectures, alters the group structure fundamentally. The introduction of block-diagonal rotation matrices between query and key projections decouples the symmetry in those weights. The symmetry group collapses to a proper subgroup GRoPE(dh,h)=Sh×(H(dh)×GL(dh))h, where H(dh) is the abelian group of invertible (scaled) rotational matrices. This reduction in symmetry expands the expressive function class of RoPE-MHA and accounts for the empirical advantages of RoPE in long-context and compositional generalization scenarios.
Generalization to Arbitrary Attention Architectures
The theoretical results extend to generalized attention mechanisms parameterized by blockwise symmetric bilinear forms that accommodate both absolute and relative (including RoPE) PE schemes. The functional identifiability result demonstrates that, given generic parameters, the only possible symmetries—subject to PE construction—are those formally derivable from the architecture itself.
Crucially, the paper's Theorems 4.1 and 4.2 precisely characterize the circumstances under which distinct parameterizations yield functionally identical attention outputs, showing that symmetry is fully determined by the PE type. For RoPE, the class of allowable transformations is strictly smaller than for APEs or vanilla MHA, quantifying the reduction in parameter space redundancy.
Of particular note, the authors rigorously prove the linear independence of attention head outputs (after symmetry reduction), leveraging concepts from the algebraic theory of exponential polynomials and classical combinatorial theorems (such as Hall's Marriage Theorem and properties of partition lattices). This structural result underpins well-posedness for model alignment and re-basin procedures in Transformers.
Linear Mode Connectivity in the Presence of Attention Symmetries
LMC captures the empirical phenomenon that interpolations between independently trained networks—modulo symmetry alignment—can trace low-loss pathways in parameter space. However, permutations and deeper symmetries obscure true functional distances. The paper develops a two-stage, data-independent alignment methodology amenable to both APE and RoPE cases. The procedure first optimizes over the head permutation (by solving a linear assignment problem with an O(h3) algorithm), then solves (subgroup-constrained) optimal alignment for each matched pair’s internal weights, leveraging analytical and numerical techniques (including SVD, orthogonal initialization, and efficient gradient flows for abelian subgroups).
The alignment algorithm is shown to be essential: ablation studies indicate that omitting either stage or using naive (e.g., orthogonal-only) matching yields significant connectivity barriers, while the full procedure consistently minimizes loss barriers between matched models, validating the importance of symmetry-respecting alignment, particularly when RoPE is present.
Empirical Evidence: LMC Across Scales, Tasks, and PE Types
Experiments span vision (ViT on MNIST, CIFAR, ImageNet), language modeling (GPT-2, Llama on Enwik8, WikiText, One Billion Word), and text classification (BERT on AGNews, IMDB, DBPedia). The main findings are:
- LMC emerges robustly for encoder-only architectures across moderate-scale benchmarks given appropriate alignment, especially when perturbing only shallow/few layers.
- LMC with RoPE is preserved—but the function space accessible is richer due to lower symmetry, reflecting the theoretical predictions.
- For large-scale, decoder-only models and challenging datasets, LMC generally fails under full-model re-initialization. This indicates increasing loss landscape complexity and possibly inherent functional isolation in such settings.
- Aligning models under LMC preserves and even improves generalization under distribution shifts (e.g., on ImageNet-C corruptions) if LMC barriers are low, whereas naive model matching in the absence of LMC yields catastrophic forgetting or loss spikes.
Implications and Future Work
The theoretical results provide a first-principles explanation for the increasing adoption of RoPE: by reducing parameter redundancy and enriching expressivity, RoPE not only facilitates sophisticated relative reasoning but also imposes beneficial constraints on functional equivalence classes, making weight-space ensembling and model alignment better-behaved.
Practically, this impacts model merging, transfer, and continual learning paradigms. In alignment-based meta-learning, enforcing or exploiting these reduced symmetries could yield more robust algorithmic frameworks. However, the negative results for LMC at scale suggest new obstacles: as model complexity and data horizons increase, architecture and initialization may need to explicitly encourage (or exploit) the “star-shaped” connectivity observed in smaller models.
Theoretically, the completeness of the symmetry classification paves the way for future attempts to rigorously certify (non-)existence of LMC in deep, non-linear architectures modulo full architectural symmetry—a challenging but foundational open problem for understanding deep loss landscapes.
Conclusion
This work delivers a mathematically complete characterization of symmetries in multihead attention, resolving the functional equivalence structure of MHA as modulated by sinusoidal and rotary encodings. The reduction of allowable symmetries under RoPE, both theoretically and empirically, translates directly to increased functional expressivity, improved alignment, and practical benefits in model connectivity. The results lay essential groundwork for principled model merging and re-basin approaches in modern attention-based architectures and highlight important limitations and future challenges as model scales and tasks grow in scope and difficulty (2606.17830).