Hyena Operators: Efficient Sequence Modeling
- Hyena Operators are attention-free sequence modeling mechanisms that combine long implicit convolutions with data-dependent gating to capture global context.
- They achieve subquadratic computational complexity by leveraging FFT-based convolutions, enabling efficient handling of very long sequences.
- Empirical benchmarks demonstrate that Hyena models match or surpass Transformer performance across language, vision, and scientific tasks while reducing training FLOPs.
Hyena Operators are a class of neural network sequence modeling mechanisms that serve as subquadratic, attention-free alternatives to self-attention within Transformer-like architectures. By interleaving implicitly parametrized long convolutions and data-controlled multiplicative gating, Hyena Operators enable efficient handling of very long-range dependencies without incurring the quadratic cost characteristic of classic attention mechanisms.
1. Principle and Mathematical Formulation
Hyena Operators are designed to provide global context modeling with improved computational efficiency. This is achieved through operator constructions that combine long (implicit) convolutional filters with data-dependent nonlinear gating. For an input sequence of length , and operator order , the core computation is formulated as:
Here, and are learned projections of the input, are learnable filters parameterized by neural networks, denotes (causal) convolution, and represents pointwise multiplication (gating).
The filters are generated implicitly as:
where Window is typically an exponential decay or similarly structured function, and is a feed-forward network acting on positional embeddings.
Efficient implementation leverages the Fast Fourier Transform (FFT):
Causality is guaranteed by using lower triangular Toeplitz matrices in the convolution, ensuring compatibility with autoregressive tasks.
2. Computational Complexity and Efficiency
The Hyena Operator achieves subquadratic complexity in both time and memory. While standard attention-based models require computation for sequence length , Hyena Operators operate in time, where is the model width and is the operator order. The number of parameters is independent of sequence length, supporting efficient scaling.
Empirically, Hyena-based models demonstrate:
- 2 speedup over highly optimized FlashAttention implementations at $8$k tokens
- Up to 100 speedup at $64$k tokens, with attention running out of memory
- 20% reduction in training FLOPs compared to equivalent Transformer architectures at $2$k sequence length
This computational profile enables practical processing of sequence lengths in the hundreds of thousands, which are infeasible for dense attention.
3. Empirical Performance and Benchmarks
Hyena Operators match or surpass attention-based and previous attention-free models on both synthetic and real-world tasks:
- Associative Recall (sequence length 131k): Hyena reaches 97.2% accuracy, far exceeding previous approaches.
- LLMing:
- WikiText103 (125M params): Hyena achieves 18.6 perplexity, equal to baseline Transformer models.
- The Pile (355M params): Hyena achieves 9.2 perplexity, nearly matching Transformer quality, with substantially lower resource usage.
- Downstream NLP Tasks: Hyena performs competitively on SuperGLUE tasks despite often requiring less total pre-training data.
Performance is further validated in other domains, as Hyena matches Transformer-level top-1 ImageNet accuracy when serving as the backbone in ViT-style architectures and demonstrates strong capability in audio and scientific sequence modeling.
4. Design and Comparison with Prior Approaches
Compared to prior subquadratic alternatives such as low-rank (Linformer, Performer), sparse attention (BigBird, Longformer), and state-space models (S4, DSS, H3), Hyena Operators have several distinguishing features:
- Parameterization: Filter length and thus receptive field can be increased arbitrarily without increasing parameter count, as filters are output by learned functions.
- Expressivity: Hierarchically interleaved convolutions and gating operations enable modeling of complex, nonlinear, data-dependent token mixing, approaching or matching the empirical expressivity of full attention.
- No hybridization required: Unlike low-rank or sparse variants, Hyena achieves state-of-the-art quality without mixing in dense attention blocks.
A summary table distinguishes Hyena from related methods:
Property | Attention | Low-Rank/Sparse | SSMs | Hyena |
---|---|---|---|---|
Complexity | - | - | ||
Expressivity | Maximal | Partial | Intermediate | Maximal |
State/Memory | ||||
Hybridization | N/A | Required | Optional | Not Needed |
5. Applications and Deployment
Hyena Operators serve as direct drop-in replacements for attention within Transformer stacks, requiring only architectural modifications to swap attention layers for Hyena blocks. Due to their efficiency and long-context capability, Hyena-equipped models are suitable for:
- Long-context LLMing (processing chapters, books, code repositories)
- Vision models (replacing attention in ViTs, yielding similar or superior performance, especially in data-limited regimes)
- Audio, scientific, and biological sequence modeling
- Fluid dynamics and partial differential equation solution operators (as demonstrated by Hyena Neural Operator models)
Hyena's design supports extension to multi-dimensional data and structured inputs, broadening its applicability across domains.
6. Extension, Limitations, and Future Directions
Key areas identified for further development include:
- Scaling Hyena architectures to larger model and context sizes with further optimizations in FFT kernels and hardware support
- Extending the operator to multi-dimensional convolution (e.g., images, 3D signals)
- Incorporating hybrid architectures (e.g., interleaving Hyena and attention layers) for task-specialized performance
- Mechanistic interpretability of Hyena's capabilities, especially in low-depth and single-layer instantiations
- Collaboration with hardware developers for FFT/convolution pipeline acceleration and support
Potential limitations include reliance on fast Fourier convolution (which may be less efficient on certain hardware) and the need for further research into initialization, filter parameterization, and deep stacking effects.
7. Summary and Impact
Hyena Operators present a subquadratic, attention-free approach to sequence modeling that achieves empirical performance on par with and, in some settings, superior to self-attention mechanisms. By breaking the quadratic bottleneck and decoupling parameter cost from context size, they enable efficient training and inference at scales previously inaccessible to dense attention models. As such, Hyena lays the foundation for new classes of large context and efficient deep learning models across modalities.