Papers
Topics
Authors
Recent
2000 character limit reached

MSSAVT: Spatial Token Redundancy in Video Models

Updated 17 December 2025
  • MSSAVT is a token redundancy metric and aggregation scheme that quantifies local similarity among spatially adjacent video tokens to enable efficient token pruning.
  • By restricting similarity computations to neighboring tokens via a checkerboard mask and targeted merging, MSSAVT preserves critical spatial structure with minimal accuracy loss.
  • Integration of MSSAVT into spatio-temporal transformers yields significant computational speedups (e.g., 2.5×) while retaining competitive accuracy on benchmarks like Kinetics-400.

Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT) is a token redundancy metric and aggregation scheme developed for efficient video understanding within transformer-based architectures. MSSAVT quantifies local spatial redundancy by explicitly restricting similarity computations to spatially adjacent tokens, enhancing positional awareness and enabling computationally efficient token pruning and merging. By leveraging MSSAVT, video models can remove large amounts of spatially redundant information while preserving spatial structure and critical visual content, thereby significantly improving computational efficiency with minimal impact on downstream task metrics (Jin et al., 14 Dec 2025, Pollard et al., 4 Jun 2025).

1. Motivation and Problem Setting

Transformer-based video models require dense patchwise tokenization of each frame, resulting in hundreds or thousands of spatial tokens per frame and large computational demands. Many of these tokens are locally redundant—representing nearly identical visual content in adjacent spatial locations—but classical similarity-based pruning and token merging strategies treat all tokens equally regardless of spatial index. This can result in loss of crucial spatial context and output degradation, especially when globally similar tokens occupy distinct and semantically significant regions (e.g., different faces or objects). MSSAVT addresses two challenges simultaneously: (1) it quantifies redundancy among tokens sharing boundaries (true spatial neighbors); (2) it does so in a computationally efficient manner suitable for per-frame, online execution (Jin et al., 14 Dec 2025).

2. Formal Definition and Algorithmic Details

Let Vn[i,j]RdV^n[i,j] \in \mathbb{R}^d denote the dd-dimensional embedding of the token at row ii, column jj in frame nn. The MSSAVT spatial redundancy score is defined as:

Rsn[i,j]=max(δi,δj){(1,0),(1,0),(0,1),(0,1)}Sim(Vn[i,j],Vn[i+δi,j+δj])R^n_s[i,j] = \max_{(\delta_i,\delta_j)\in\{(-1,0),(1,0),(0,-1),(0,1)\}} \mathrm{Sim}\left(V^n[i,j],\,V^n[i+\delta_i,j+\delta_j]\right)

where Sim(x,y)=xyx  y\mathrm{Sim}(x,y) = \frac{x \cdot y}{\|x\|\;\|y\|} is the cosine similarity.

This restricted local maximum prevents the merging or dropping of globally similar but spatially distant tokens, thus maintaining spatial structure. The procedure is linear in the number of tokens and exploits the inherent grid connectivity of patch embeddings. Pseudocode for computing the redundancy map RsnR^n_s is as follows:

1
2
3
4
5
6
7
8
9
for i in 0W1:
    for j in 0H1:
        max_sim = 
        for (δ_i, δ_j) in [(-1,0), (1,0), (0,1), (0,1)]:
            if 0  i+δ_i < W and 0  j+δ_j < H:
                sim = cosine_similarity(V[i,j], V[i+δ_i,j+δ_j])
                if sim > max_sim:
                    max_sim = sim
        R_s[i,j] = max_sim
(Jin et al., 14 Dec 2025)

3. Masked Pruning and Aggregation Strategies

Directly pruning all tokens with Rsn[i,j]>τsR^n_s[i,j] > \tau_s can lead to cascading effects, where removing one token makes its neighbor a new redundancy candidate, potentially resulting in over-pruning. To resolve this, MSSAVT incorporates a masked pruning regime using a spatial checkerboard mask:

Mp[i,j]={True,if (i+j)mod2=1 False,otherwiseM_p[i,j] = \begin{cases} \text{True}, & \text{if } (i+j) \bmod 2=1 \ \text{False}, & \text{otherwise} \end{cases}

Tokens are dropped only if Rsn[i,j]>τsR^n_s[i,j] > \tau_s and Mp[i,j]M_p[i,j] is True. This ensures that no two pruned tokens are adjacent, so spatial redundancy measurements remain valid throughout the pruning pass, avoiding undesired “cascade” effects. Symmetry in mask construction ensures that all possible prune candidates are considered across alternating token sets (Jin et al., 14 Dec 2025).

In aggregation or merging contexts, as in spatial aggregation modules (SAM), the tokens are partitioned into two disjoint sets (e.g., checkerboard A/B), and pairwise similarities are maximized across the partition. Merging is then greedily performed for the top RsR_s most similar pairs, further reducing redundancy while controlling the spatial structure (Ren et al., 2023, Pollard et al., 4 Jun 2025).

4. Integration with Spatio-Temporal Transformers

MSSAVT integrates naturally into both joint spatio-temporal (e.g., ViViT, VideoMAE) and divided space-time (e.g., TimeSformer, Motionformer) video transformer architectures. In joint models, MSSAVT can operate on the entire spatio-temporal token set, freely matching across both axes. For divided architectures, MSSAVT is applied independently within each frame’s grid of spatial tokens: the input XRB×F×SiI×DX \in \mathbb{R}^{B \times F \times S^I_i \times D} is reshaped to (BF)×SiI×D(B\cdot F) \times S^I_i \times D, subjected to MSSAVT per-frame, and reassembled (Pollard et al., 4 Jun 2025).

MSSAVT is commonly injected after each multi-head self-attention block and before the feed-forward sub-layer. Merging or dropping decisions are made progressively layer-by-layer, with per-layer budgets or thresholds controlling token reduction schedules (constant, increasing, decreasing).

5. Comparative Analysis with Other Redundancy Reduction Methods

Empirical results consistently show that MSSAVT outperforms random patch dropout, attention-ranked patch dropout, and random token merging in terms of both speedup and accuracy preservation. For example, using a constant per-layer merge budget targeting \sim10% token reduction per block yields a 2.5×\times speedup in inference while maintaining a \leq0.55% drop in top-1 accuracy on ViViT for Kinetics-400; VideoMAE experiences slightly greater degradation but remains competitive compared to alternatives. Random merging leads to major accuracy collapse, while attention-based approaches, though better than dropout, lag MSSAVT by 3–5% absolute (Pollard et al., 4 Jun 2025).

The following table summarizes key comparative metrics (as reported):

Method Kinetics-400 Top-1 (ViViT) Inference Speedup
Baseline 76.63% 1×\times
MSSAVT 76.08% 2.46×\times
Attention-ranked Drop \sim73–74% variable
Random Patch Drop lower variable
Random Merge near-chance variable

(Pollard et al., 4 Jun 2025)

6. Extensions: Temporal Filtering and Joint Pruning

In complex, long-form and online video settings, spatial token redundancy rarely occurs in isolation. MSSAVT is therefore typically integrated with temporal redundancy removal (e.g., DTD from TimeChat-Online). The two-stage pipeline first applies a temporal filter to remove frames or patches recurring over time, followed by MSSAVT-based spatial pruning or merging. The resulting composite mask is:

Mn[i,j]=Mtn[i,j]Msn[i,j]M^n[i,j] = M^n_t[i,j] \lor M^n_s[i,j]

where MtnM^n_t is the temporal mask and MsnM^n_s the spatial mask. Only tokens surviving both are retained for subsequent inference. On challenging video QA and retrieval benchmarks (StreamingBench, OVO-Bench, Video-MME, LongVideoBench), this pipeline yields up to +4% absolute accuracy improvement over temporal-only baselines at 91.5–93.6% token drop rates, while adding less than 1 ms per-frame latency (Jin et al., 14 Dec 2025).

7. Empirical Impact and Best Practices

Across diverse model backbones (ViViT, VideoMAE, TimeSformer, Motionformer) and datasets (Kinetics-400, Something-Something v2, EPIC-KITCHENS-100), MSSAVT consistently delivers substantial computational savings with minimal or negligible accuracy loss. For example, approximately 75% spatial token reduction via spatial aggregation modules corresponds to only a 2.7 percentage point drop in R@1 for paragraph-to-video retrieval while halving GFLOPs on QuerYD (Ren et al., 2023).

Observed empirical best practices include favoring constant or increasing layerwise merge/drop schedules (to protect fine-grained features retained by early layers) and calibrating merge budgets to avoid over-reduction of small or semantically critical objects. Merged clusters reflect visual similarity alone; heavy merging can lead to loss of small or fine-grained objects, suggesting careful tuning when high fidelity is required (Pollard et al., 4 Jun 2025, Ren et al., 2023).

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube