Efficient Video Attention (EVA)

Updated 1 January 2026

The paper demonstrates that EVA techniques significantly reduce spatiotemporal complexity from O((ST)^2) to O(S+T), enabling efficient high-resolution video modeling.
EVA methods leverage axis-factorization, pooling-based context formulation, and sparse masking to achieve high throughput with minimal overhead across diverse video tasks.
Practical integration of EVA in architectures like RAPTOR and AIA blocks yields state-of-the-art performance in video prediction, classification, and generation.

Efficient Video Attention (EVA) encompasses a set of architectural and algorithmic innovations enabling video models to scale spatiotemporal context modeling with near-zero overhead, high throughput, and minimal computational cost. EVA approaches circumvent classic limitations of video Transformers and non-local attention by combining axis-factorization, pooling-based context formulation, sparse windowing, and nested or block-wise operations. They appear in diverse research domains including video prediction, long-form video generation, and large-scale video classification, significantly impacting state-of-the-art results across multiple tasks and hardware platforms.

1. Architectural Paradigms and Principle Designs

Efficient Video Attention has emerged along several distinct lines:

Axis-Factorized Attention: RAPTOR (Chen et al., 25 Dec 2025) introduces EVA blocks that alternately perform temporal (“TimeMix”) and spatial (“SpaceMix”) mixing in dense feature space, reducing the quadratic time and memory complexity $O((ST)^2)$ of canonical attention to linear $O(S+T)$ via sequential 1D operations along each axis. This enables direct modeling over dense feature maps at up to $1024^2$ resolution.
Pooling/Squeezing Contexts: Attention-in-Attention modules (Hao et al., 2022) perform global average and max pooling along each major axis (channel, time, spatial), yielding complementary low-dimensional “global” context vectors, which are then refined using lightweight 3D convolutions.
Sparse and Structured Masking: Compact Attention (Li et al., 18 Aug 2025) leverages empirical sparsity in video transformer attention maps, constructing adaptive block-wise masks that select only critical local, cross-shaped, or global patterns, combined with temporally varying windows for efficiency.
What-Where-When Factorization: The W³ module (Perez-Rua et al., 2020) factorizes video attention into distinct branches for “what” (channel), “where” (spatial), and “when” (temporal), each computed by lightweight pooling, MLP, or convolutional modules.

Common principles include mathematically reducing attention's order of complexity, preserving or re-injecting spatial resolution without patchification, and explicit context correlation modeling.

2. Mathematical Underpinnings and Implementation Details

Each EVA variant refines attention as follows:

Axis-Factorization (RAPTOR EVA Block)

Given latent map $E \in \mathbb{R}^{B \times T \times S}$ :

TimeMix ( $O(T)$ ): For each of $S$ spatial–channel locations, apply a learnable LGU (Linear Gated Unit) over the $T$ frames.
SpaceMix ( $O(S)$ ): For each time step, treat the $S$ spatial–channel locations as a sequence.

LGU operation: $k = \text{Mish}(U W_K), \quad v = U W_V, \quad r = \sigma(U W_R) \ \text{LGU}(U) = (r \odot (k \odot v)) W_O$

Pseudo-code for one EVA block:

def EVA_Block(E):  # shape: [B, T, S]
    U = LayerNorm(E)
    U_shifted = temporal_shift(U)
    T_out = LGU(U_shifted)
    E1 = E + T_out
    V = LayerNorm(E1)
    V_t = V.transpose(0,2,1)
    S_out_t = LGU(V_t)
    S_out = S_out_t.transpose(0,2,1)
    return E1 + S_out

Context Pooling and Nested Attention (AIA Module)

For $X \in \mathbb{R}^{C \times T \times H \times W}$ , construct context groups:

Channel: $G_C = \text{Concat}(\text{AvgPool}_{T,H,W}(X), \text{MaxPool}_{T,H,W}(X))$
Temporal: $G_T = \text{Concat}(\text{AvgPool}_{T}(X), \text{MaxPool}_{T}(X))$ , similarly for $G_H, G_W$

Attention-in-attention modules nest C-unit and ST-unit blocks:

CinST: Channel attention primes spatio-temporal contexts.
STinC: Spatio-temporal attention primes channel context.

Each uses tiny $3 \times 3 \times 3$ convolutions on pooled dimensions, costing $<0.02\%$ additional FLOPs per block.

Sparse Pattern Masking (Compact Attention)

Construct tile masks $M(q,k)$ for query $q, k$ keys via:

Block masks (local/cross-shaped) using scaling parameters $(\omega, \eta)$
Temporally varying windows where window sizes depend on temporal proximity
Automated boundary search algorithm selects mask configuration with minimal recall loss per unit cost.

No runtime overhead for pattern selection; masks are precomputed and reused for multiple denoising steps.

3. Computational Complexity and Performance Analysis

Comparative Costs (RAPTOR)

For $T=10$ , $S=16,384$ , $D=512$ :

Method	FLOPs	Memory	Time (Jetson Orin)
ViT	$2.75 \times 10^{13}$	$54$ GB	$\sim 2.6$ s/layer
RWKV	$3.44 \times 10^{11}$	$2.4$ GB	$32$ ms
EVA	$1.72 \times 10^{10}$	$1.0$ GB	$5.1$ ms

EVA is $1,600\times$ faster than quadratic attention and $20\times$ faster than linear RWKV (Chen et al., 25 Dec 2025).

AIA Blocks

For $C=256$ , $T=8$ , $H=W=28$ (ResNet-50 stage):

Backbone 3D Conv: $>10^9$ FLOPs
EVA block: $\sim 10^6$ FLOPs ( $<0.02\%$ overhead) (Hao et al., 2022)

Compact Attention Benchmarks

On Wan2.1 and Hunyuan:

Model	Method	Sparsity	PSNR	Speedup
Wan2.1	EVA (τ=0.9)	33.99%	23.73	1.65×
Hunyuan	EVA (τ=0.9)	62.36%	30.08	2.51×

Visual quality metrics (PSNR, SSIM, CLIPSIM) match full attention. No runtime cost for pattern selection (Li et al., 18 Aug 2025).

4. Integration into Video Models

Efficient Video Attention modules function as plug-and-play operations:

RAPTOR: EVA blocks are stacked in the “Translator,” preserving full spatial resolution and delivering single-pass, error-free video prediction at scale (Chen et al., 25 Dec 2025).
Classification Backbones: EVA/AIA modules are inserted after every residual block (e.g., 3D-ResNet, TSN, TSM), reweighting output features without altering backbone structure (Hao et al., 2022).
Video Diffusion Transformers: Compact Attention applies masking in core attention operations with precomputed spatial-temporal patterns (Li et al., 18 Aug 2025).
W³ Module: Integrated into I3D, TSM, TAM, and distillation/regularization layers for action recognition (Perez-Rua et al., 2020).

In most frameworks, EVA does not require changes to main-path convolutions, only local refinement.

5. Experimental Results and Benchmark Impact

EVA architectures demonstrate consistent improvements in throughput and prediction/classification quality:

RAPTOR (Chen et al., 25 Dec 2025):
- First to exceed 30 FPS on Jetson AGX Orin at $512^2$ resolution.
- Sets state-of-the-art on UAVid, KTH, and custom UAV datasets in PSNR (28.4 dB), SSIM (0.784), and LPIPS (0.095); ViT-style attentions are out of memory at comparable sizes.
- $+18\%$ mission success rate in real-world UAV navigation.
AIA Modules (Hao et al., 2022):
- Gains of $+28.8\%$ top-1 accuracy (e.g. 19.7 → 48.5 on Something-Something V1 TSN).
- Model size and FLOPs increase negligibly (23.86 M → 23.87 M, 32.88 G → 33.01 G).
Compact Attention (Li et al., 18 Aug 2025):
- $1.6{-}2.5\times$ speedup in long-form video generation at constant perceptual quality.
- GPU utilization and blockwise memory cost reduced proportional to estimated sparsity $(\alpha=0.3{-}0.6)$ .
W³ Module (Perez-Rua et al., 2020):
- $+5.4\%$ top-1 accuracy (TSM baseline $47.2\%$ → W³ $52.6\%$ ) at $+3.2\%$ FLOPs, outperforming non-local attention (+80% FLOPs).
- Ablation confirms gains from “what,” “where,” “when” branches and deep supervision.

EVA distinguishes itself from:

Quadratic Attention: Standard transformers or non-local video attention are intractable at scale ( $O(N^2)$ in both time and memory).
Linear/Performer/RWKV Attention: Linear approximations yield $O(ST)$ cost, but struggle with memory scalability when $S,T$ are large.
Patchification and Token-Reduction: EVA achieves dense per-pixel modeling without spatial downsampling or coarse patch aggregation.

Notably, axis-factorization, blockwise pooling, and sparse masking are employed synergistically to preserve information, reduce overfitting, and maximize hardware efficiency. The efficacy of nested attention blocks further highlights context correlation rarely captured in prior single-axis or serial attention models.

7. Limitations and Future Directions

While EVA techniques consistently yield substantial efficiency and accuracy improvements, a plausible implication is that their reliance on axis-factorization and structured sparsity presumes stability in attention distributions. This may limit expressivity for highly nonstationary or semantically complex scenes unless adaptively tuned. Extensions may include dynamic mask reconfiguration or hybridization with global models for rare long-range dependencies.

In summary, Efficient Video Attention defines a family of techniques for scalable, high-throughput video modeling, breaking the $O((ST)^2)$ spatiotemporal barrier and demonstrating versatility across prediction, classification, and generation tasks (Chen et al., 25 Dec 2025, Hao et al., 2022, Li et al., 18 Aug 2025, Perez-Rua et al., 2020).