Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 48 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 107 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 473 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

ENA: Efficient N-dimensional Attention (2508.11921v1)

Published 16 Aug 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformer. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for LLMing, to high-order data (1D to ND): scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention and find that tiled high-order sliding window attention (SWA) is efficient in both theory and practice. We term the resulting hybrid architecture of linear recurrence and high-order SWA as Efficient N-dimensional Attention (ENA). We then conduct several experiments to demonstrate its effectiveness. The intuition behind ENA is that linear recurrence compresses global information into a state, while SWA complements it by enforcing strict local modeling. Together, they form a simple framework that offers a promising and practical solution for ultra-long high-order data modeling.

Summary

The paper introduces ENA, demonstrating that combining linear recurrence with high-order sliding window attention enables efficient long sequence processing.
The paper employs DeltaNet and Sliding Tile Attention to compress global context while preserving local details, enhancing classification and generation tasks.
The paper shows that ENA scales to sequences up to 16K tokens with high sparsity, achieving competitive performance with improved computational efficiency.

ENA: Efficient N-dimensional Attention

Introduction

The paper presents ENA (Efficient N-dimensional Attention), an innovative architecture that combines linear recurrence models with high-order sliding window attention (SWA) to efficiently process long sequences of high-dimensional data. The focus is on overcoming the limitations of traditional Transformers, particularly their inefficiencies in handling long sequences due to quadratic complexity in softmax attention. The ENA architecture offers a promising alternative by interleaving layers of linear recurrence with local attention, promoting both efficiency and expressiveness in sequence modeling.

Architecture and Implementation

The ENA architecture is structurally simple yet effective. It employs layers of linear recurrence interleaved with sliding window attention (SWA). This hybrid approach is designed to compress global context with linear models while maintaining local fidelity with attention mechanisms. High-order SWA is particularly chosen for its ability to model locality without non-overlapping block constraints, unlike block attention.

The implementation of ENA leverages DeltaNet as the primary linear recurrence model, renowned for its efficient training and expressiveness, while STA (Sliding Tile Attention) is used for attention due to its hardware efficiency. ENA avoids the traditional overheads associated with scanning and sequence permutation, emphasizing simplicity and speed.

Figure 1: Performance comparison between ENA and Flash Attention (FA)-based Transformer vision encoders across different sequence lengths.

Experimental Validation

The effectiveness of ENA is demonstrated through extensive experiments across various tasks, including image and video classification and generation. The key results include:

Image and Video Classification: On datasets such as ImageNet-1K and K400, ENA achieves comparable or superior performance to Transformers, with significant improvements in computational efficiency, particularly for long sequences.
Image Generation: The application of ENA to image generation tasks confirms its benefits. The model achieves competitive IS and FID scores on ImageNet, underscoring the robustness of combining linear recurrence with sparse attention.
Long Sequence Processing: For very long sequences (up to 16K tokens), ENA demonstrates its scalability, maintaining performance with reduced computational demands. Adjusting the sparsity in attention allows ENA to match full attention without the associated complexity.
Figure 2: Selected image generation results from ena-deltanet-sta-w24x24-t8x8-xl-gen2d on ImageNet with a resolution of 512 \times 512.

Performance and Design Trade-offs

The paper provides a thorough discussion of the design choices in ENA, particularly the trade-offs between scanning and hybrid architectures. The finding that hybrid models outperform scanning techniques aligns with the observation that effective local attention compensates for any inefficiencies in linear models regarding local pattern recognition.

Attention sparsity emerges as a critical factor, where ENA's performance maintains robustness even with high levels of sparsity (up to 70%), facilitating faster computation without significant losses in accuracy. The scalability of ENA is further highlighted by its adaptability across varying data dimensions and sequence lengths.

Figure 3: A simple illustration of the operations performed by different scanning methods.

Implications and Future Directions

The introduction of ENA underscores a shift towards hybrid architectures in AI, especially for tasks involving high-dimensional data and long sequence dependencies. ENA's design strategies prioritize computational efficiency and scalability, making it conducive for deployment in resource-constrained environments.

The paper suggests potential enhancements in STA implementations for further speed gains, indicating future research avenues in optimizing kernel designs for attention mechanisms. The discussion on the choice of optimizers and learning rates offers insights into the flexibility of the ENA framework across different training paradigms and conditions.

Conclusion

ENA positions itself as a viable successor to traditional transformers for sequence modeling tasks. By marrying linear recurrent networks with local attention, ENA not only improves efficiency but also simplifies architectural complexity. As AI systems demand more computational resources, architectures like ENA that balance performance with efficiency will gain prominence, encouraging further research into hybrid models and attention mechanisms.