Papers
Topics
Authors
Recent
Search
2000 character limit reached

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Published 7 Apr 2026 in cs.CV and cs.AI | (2604.06129v1)

Abstract: This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

Summary

  • The paper introduces PoM as a linear complexity alternative to self-attention, using a polynomial mixer to aggregate token context efficiently.
  • It employs a two-stage process with shared state construction and token-specific gating, ensuring permutation equivariance and universal sequence representation.
  • Experimental evaluations across NLP, vision, OCR, and geospatial tasks show 2โ€“4ร— speedups and memory efficiency without compromising performance.

PoM: Linear-Time Attention Replacement with the Polynomial Mixer

Introduction and Motivation

The "PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer" (2604.06129) presents the Polynomial Mixer (PoM), a novel sequence mixing module that achieves linear complexity with respect to sequence length and is proposed as a direct substitute for the self-attention mechanism in Transformer architectures. This development addresses the longstanding computational bottleneck of Multi-Head Attention (MHA), whose quadratic time and memory complexity in sequence length nn significantly restricts scalability across modalitiesโ€”especially in vision and geospatial tasks, where input sequences scale rapidly with resolution and coverage.

The PoM design is inspired by representation learning with high-order moments and polynomial features, leveraging these concepts to aggregate token context efficiently. The paper extensively validates PoM in five disparate real-world domainsโ€”ranging from generative modeling to semantic segmentationโ€”demonstrating that the replacement of attention with PoM does not compromise performance relative to quadratic-cost MHA. Instead, PoM yields substantial speed and memory efficiency gains, especially at long sequence lengths, a regime where even highly optimized attention implementations like FlashAttention begin to lag.

Polynomial Mixer Architecture

Mechanism and Formalism

PoM introduces a two-stage aggregation-query token mixing process. Given an input tensor XโˆˆRdร—nX \in \mathbb{R}^{d \times n} of nn tokens, each with feature dimension dd, the process is as follows:

  1. Shared State Construction: Each input token is projected to a higher-dimensional space and then transformed via a degree-kk polynomial, parameterized by learnable coefficients. The result is aggregated over the sequence to form a single contextual state vector H(X)โˆˆRDH(X) \in \mathbb{R}^D.
  2. Token-Specific Extraction: Each token produces a gating vector, which multiplicatively gates H(X)H(X) to extract relevant context and projects this back to the original dimension.

The module ensures all operations are linear in both computation and memory with respect to sequence length nn. The effective expressivity of PoM derives from the polynomial expansion, capable of capturing complex token interactions.

Permutation Equivariance and Universality

The PoM construction is formally shown to be permutation equivariant, a critical property for sequence models. The authors further proveโ€”mirroring the universal approximation guarantees for MHAโ€”that Transformers constructed with PoM (termed PolyMorphers) are universal sequence-to-sequence approximators. This hinges on the contextual mapping property and the inclusion of positional encoding.

Adapting PoM to Sequence Causality

PoM naturally extends to handle causal and arbitrary attention masks. By defining time-step-dependent or block-causal state representations, PoM can process causal or structured dependencies efficiently, supporting both parallel training and recursive inference with O(1)\mathcal{O}(1) complexity per token.

Computational Analysis

A detailed comparison of computational costs highlights the transition point beyond which PoM becomes decisively more efficient than attention:

  • For typical Transformer settings (e.g., d=512d=512โ€“XโˆˆRdร—nX \in \mathbb{R}^{d \times n}0, XโˆˆRdร—nX \in \mathbb{R}^{d \times n}1, XโˆˆRdร—nX \in \mathbb{R}^{d \times n}2), PoM outpaces MHA as soon as XโˆˆRdร—nX \in \mathbb{R}^{d \times n}3 (which is typically at modest sequence lengths, e.g., few thousand tokens).
  • Compared to highly engineered solutions such as FlashAttention, PoM's PyTorch-based implementation is already faster for long-context applications, despite not using custom kernels. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 2: Evaluation across NLP, OCR, 3D segmentation, and Earth observation, demonstrating PoMโ€™s retention of performance and speed advantage compared to MHA and hybrid models.

Experimental Evaluation Across Domains

The authors conduct a systematic empirical analysis of PoM in five diverse tasks: natural language modeling, image generation, OCR, point cloud segmentation, and remote sensing time series segmentation.

Natural Language Processing

GPT2 variants with PoM blocks achieve validation and downstream benchmark scores nearly matching standard MHA counterparts. Hybrid architectures (combining PoM and local attention) completely close the remaining performance gap, occasionally surpassing the vanilla Transformer. PoM-based models exhibit throughput improvements exceeding XโˆˆRdร—nX \in \mathbb{R}^{d \times n}4 versus Transformer baselines at long sequences, outpacing Mamba and maintaining stable accuracy.

Optical Character Recognition

Replacing self-attention in state-of-the-art handwritten text recognition models with PoM blocks yields competitive error rates (CER, WER) on both single- and multi-line tasks. The throughput improvements are more pronounced as the number of aggregated input sequences (lines) increases, highlighting the efficiency gain in more demanding, long-context regimes.

Earth Observation and 3D Point Cloud Segmentation

In dense, high-dimensional modalities (e.g., crop classification from satellite time series, point cloud semantic segmentation), PoM matches or slightly trails MHA in accuracy (mean IoU, OA), but inference throughput improves over XโˆˆRdร—nX \in \mathbb{R}^{d \times n}5, enabling real-time or large-scale inference in scenarios where MHA would face computational or memory barriers.

Image Generation

PoM substitutes for attention in scalable transformer-based diffusion models (e.g., SiT, DiT) trained on ImageNet for class-conditional synthesis. The replacement leaves FID scores unchanged, while drastically reducing computational latency per image as resolution increases. Figure 3

Figure 1: Qualitative sample diversity for class-conditional image generation using SiPoM-XL/2, underscoring preservation of synthesis quality over PoM.

Discussion of Results and Claims

A salient claimโ€”validated by extensive ablation and cross-domain evaluationโ€”is that PoM yields negligible or no performance penalty relative to MHA for most practical settings when appropriately parameterized. The hybrid scheme, where local attention layers alternate with PoM blocks, demonstrates that global context provided by PoM can be complemented with fine-grained local modeling.

Numerical results reported underscore throughput gains:

  • In English Wikipedia-scale language modeling, PoM-based architectures can process sequences up to XโˆˆRdร—nX \in \mathbb{R}^{d \times n}6k tokens at up to XโˆˆRdร—nX \in \mathbb{R}^{d \times n}7k tokens/s (see Table 2).
  • PoM delivers XโˆˆRdร—nX \in \mathbb{R}^{d \times n}8โ€“XโˆˆRdร—nX \in \mathbb{R}^{d \times n}9 speedup over FlashAttention at high resolutions in image generation.
  • For satellite time series, throughput increases from nn0 to nn1 kmnn2/s with PoM versus attention-based models.

Importantly, these improvements are achieved using high-level implementations, implying the potential for even greater acceleration with custom kernels.

Implications and Future Directions

Practically, PoM's linear complexity enables real-time processing and direct modeling of longer sequencesโ€”including high-resolution visual content, extended text, or complex multimodal dataโ€”without imposing arbitrary restrictions. Theoretically, PoM retains the universality of Transformers for sequence modeling within a more efficient computational envelope.

Key implications include:

  • The architectural bottleneck imposed by quadratic self-attention can be circumvented without sacrificing model expressivity or accuracy, provided mixing operations are designed to preserve sequence context and permutation properties.
  • PoM unlocks architecturally feasible training and deployment for long-context and high-dimensional tasks, which will likely facilitate new instantiations of multimodal and generative AI models.

Looking ahead, further optimizationโ€”particularly of hardware-aware kernelsโ€”will reduce overhead and extend PoM's advantage in all regimes. Exploration of PoM within large-scale pretraining and autoregressive modeling pipelines, as well as its hybridization with other efficient architectures (e.g., SSMs, convolutional modules), represents promising avenues.

Conclusion

The Polynomial Mixer establishes a new efficiency-accuracy frontier in sequence modeling, matching self-attention's empirical capabilities with dramatically lower resource requirements. The generality with which PoM can substitute for MHA across NLP, vision, geospatial, and generative modeling tasks, without specialized tuning or compromise, argues strongly for its adoption in future efficient Transformer frameworks. The modularity and formal guarantees of PoM suggest its seamless integration within existing architectures, providing a scalable path forward for the next generation of AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.