Attention Is Not What You Need

This lightning talk explores a revolutionary approach to sequence modeling that completely eliminates the attention mechanism. The researchers propose replacing Transformer self-attention with geometric features derived from Grassmann manifolds, achieving comparable performance while offering linear complexity and potentially greater interpretability through structured mathematical invariants.
Script
What if the very mechanism that powers modern language models is actually unnecessary? The researchers behind this work challenge the fundamental assumption that attention is required for effective sequence modeling, proposing instead a completely different approach based on geometric manifolds.
To understand why this matters, we need to examine what makes attention both powerful and problematic.
Building on this challenge, the authors identify attention as fundamentally a tensor lifting operation that becomes analytically intractable. Most efficiency improvements still rely on computing those massive pairwise weight matrices.
Instead of fixing attention, what if we replaced it entirely with something mathematically structured?
The core insight is elegant: instead of attention's unconstrained pairwise lifting, they use the geometry of Grassmann manifolds to create structured, finite-dimensional interaction features. Think of it as replacing chaotic tensor explosions with controlled geometric flows.
Let me walk you through how this geometric approach actually processes sequences.
The process starts by reducing token representations to a lower-dimensional geometric space. Then it creates local token pairs across multiple window sizes and encodes each pair as a point on a Grassmann manifold using Plucker coordinates.
The implementation uses causal pairing with exponentially spaced offsets to capture multi-scale patterns. Plucker coordinates provide the mathematical bridge between geometry and computation, while learned gates control how much geometric information to incorporate.
Here's the key computational advantage: while attention scales quadratically with sequence length, Grassmann mixing achieves linear scaling by constraining interactions to local windows and fixed-dimensional geometric features.
Now let's see how this theoretical elegance translates to real performance.
On Wikitext-2 language modeling, the Grassmann approach comes remarkably close to Transformer performance. Importantly, deeper models show better relative performance, suggesting the geometric mixing can approximate richer interactions through layer stacking.
On the SNLI entailment task, the Grassmann approach actually outperforms the attention baseline slightly, showing this isn't just a language modeling phenomenon but applies to classification tasks as well.
Of course, this is early-stage research with some important caveats to consider.
The main practical limitation is that current implementations lack the hardware optimizations that make attention fast. The approach also relies purely on local windows for long-range dependencies, which may not capture all the patterns attention can.
But perhaps the most intriguing aspect isn't just efficiency - it's what this could mean for understanding models.
Unlike attention weights that resist mathematical analysis, Plucker coordinates live in well-understood geometric spaces with algebraic structure. This opens up possibilities for tracking how information flows through geometric transformations rather than opaque tensor operations.
This work opens several promising research directions that could transform sequence modeling.
The authors outline exciting extensions: moving beyond 2D subspaces, incorporating global geometric feedback, and hybrid designs. The engineering challenge of optimized kernels could unlock the theoretical complexity advantages.
More broadly, this could establish an entirely new paradigm for sequence modeling based on geometric flows rather than attention. The mathematical structure might finally give us tools to truly understand what these models learn.
Let me wrap up with the core insights that make this work significant.
This isn't just another efficiency improvement - it's a fundamental reconceptualization of how sequences can be modeled. By replacing attention's tensor chaos with geometric structure, the authors show that attention may not be as indispensable as we thought.
The geometric approach to sequence modeling represents a fascinating departure from attention-based architectures, proving that mathematical elegance and computational efficiency can coexist. For more cutting-edge research that challenges conventional wisdom in AI, visit EmergentMind.com.