MSSAVT: Spatial Token Redundancy in Video Models
- MSSAVT is a token redundancy metric and aggregation scheme that quantifies local similarity among spatially adjacent video tokens to enable efficient token pruning.
- By restricting similarity computations to neighboring tokens via a checkerboard mask and targeted merging, MSSAVT preserves critical spatial structure with minimal accuracy loss.
- Integration of MSSAVT into spatio-temporal transformers yields significant computational speedups (e.g., 2.5×) while retaining competitive accuracy on benchmarks like Kinetics-400.
Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT) is a token redundancy metric and aggregation scheme developed for efficient video understanding within transformer-based architectures. MSSAVT quantifies local spatial redundancy by explicitly restricting similarity computations to spatially adjacent tokens, enhancing positional awareness and enabling computationally efficient token pruning and merging. By leveraging MSSAVT, video models can remove large amounts of spatially redundant information while preserving spatial structure and critical visual content, thereby significantly improving computational efficiency with minimal impact on downstream task metrics (Jin et al., 14 Dec 2025, Pollard et al., 4 Jun 2025).
1. Motivation and Problem Setting
Transformer-based video models require dense patchwise tokenization of each frame, resulting in hundreds or thousands of spatial tokens per frame and large computational demands. Many of these tokens are locally redundant—representing nearly identical visual content in adjacent spatial locations—but classical similarity-based pruning and token merging strategies treat all tokens equally regardless of spatial index. This can result in loss of crucial spatial context and output degradation, especially when globally similar tokens occupy distinct and semantically significant regions (e.g., different faces or objects). MSSAVT addresses two challenges simultaneously: (1) it quantifies redundancy among tokens sharing boundaries (true spatial neighbors); (2) it does so in a computationally efficient manner suitable for per-frame, online execution (Jin et al., 14 Dec 2025).
2. Formal Definition and Algorithmic Details
Let denote the -dimensional embedding of the token at row , column in frame . The MSSAVT spatial redundancy score is defined as:
where is the cosine similarity.
This restricted local maximum prevents the merging or dropping of globally similar but spatially distant tokens, thus maintaining spatial structure. The procedure is linear in the number of tokens and exploits the inherent grid connectivity of patch embeddings. Pseudocode for computing the redundancy map is as follows:
1 2 3 4 5 6 7 8 9 |
for i in 0…W−1: for j in 0…H−1: max_sim = −∞ for (δ_i, δ_j) in [(-1,0), (1,0), (0,−1), (0,1)]: if 0 ≤ i+δ_i < W and 0 ≤ j+δ_j < H: sim = cosine_similarity(V[i,j], V[i+δ_i,j+δ_j]) if sim > max_sim: max_sim = sim R_s[i,j] = max_sim |
3. Masked Pruning and Aggregation Strategies
Directly pruning all tokens with can lead to cascading effects, where removing one token makes its neighbor a new redundancy candidate, potentially resulting in over-pruning. To resolve this, MSSAVT incorporates a masked pruning regime using a spatial checkerboard mask:
Tokens are dropped only if and is True. This ensures that no two pruned tokens are adjacent, so spatial redundancy measurements remain valid throughout the pruning pass, avoiding undesired “cascade” effects. Symmetry in mask construction ensures that all possible prune candidates are considered across alternating token sets (Jin et al., 14 Dec 2025).
In aggregation or merging contexts, as in spatial aggregation modules (SAM), the tokens are partitioned into two disjoint sets (e.g., checkerboard A/B), and pairwise similarities are maximized across the partition. Merging is then greedily performed for the top most similar pairs, further reducing redundancy while controlling the spatial structure (Ren et al., 2023, Pollard et al., 4 Jun 2025).
4. Integration with Spatio-Temporal Transformers
MSSAVT integrates naturally into both joint spatio-temporal (e.g., ViViT, VideoMAE) and divided space-time (e.g., TimeSformer, Motionformer) video transformer architectures. In joint models, MSSAVT can operate on the entire spatio-temporal token set, freely matching across both axes. For divided architectures, MSSAVT is applied independently within each frame’s grid of spatial tokens: the input is reshaped to , subjected to MSSAVT per-frame, and reassembled (Pollard et al., 4 Jun 2025).
MSSAVT is commonly injected after each multi-head self-attention block and before the feed-forward sub-layer. Merging or dropping decisions are made progressively layer-by-layer, with per-layer budgets or thresholds controlling token reduction schedules (constant, increasing, decreasing).
5. Comparative Analysis with Other Redundancy Reduction Methods
Empirical results consistently show that MSSAVT outperforms random patch dropout, attention-ranked patch dropout, and random token merging in terms of both speedup and accuracy preservation. For example, using a constant per-layer merge budget targeting 10% token reduction per block yields a 2.5 speedup in inference while maintaining a 0.55% drop in top-1 accuracy on ViViT for Kinetics-400; VideoMAE experiences slightly greater degradation but remains competitive compared to alternatives. Random merging leads to major accuracy collapse, while attention-based approaches, though better than dropout, lag MSSAVT by 3–5% absolute (Pollard et al., 4 Jun 2025).
The following table summarizes key comparative metrics (as reported):
| Method | Kinetics-400 Top-1 (ViViT) | Inference Speedup |
|---|---|---|
| Baseline | 76.63% | 1 |
| MSSAVT | 76.08% | 2.46 |
| Attention-ranked Drop | 73–74% | variable |
| Random Patch Drop | lower | variable |
| Random Merge | near-chance | variable |
6. Extensions: Temporal Filtering and Joint Pruning
In complex, long-form and online video settings, spatial token redundancy rarely occurs in isolation. MSSAVT is therefore typically integrated with temporal redundancy removal (e.g., DTD from TimeChat-Online). The two-stage pipeline first applies a temporal filter to remove frames or patches recurring over time, followed by MSSAVT-based spatial pruning or merging. The resulting composite mask is:
where is the temporal mask and the spatial mask. Only tokens surviving both are retained for subsequent inference. On challenging video QA and retrieval benchmarks (StreamingBench, OVO-Bench, Video-MME, LongVideoBench), this pipeline yields up to +4% absolute accuracy improvement over temporal-only baselines at 91.5–93.6% token drop rates, while adding less than 1 ms per-frame latency (Jin et al., 14 Dec 2025).
7. Empirical Impact and Best Practices
Across diverse model backbones (ViViT, VideoMAE, TimeSformer, Motionformer) and datasets (Kinetics-400, Something-Something v2, EPIC-KITCHENS-100), MSSAVT consistently delivers substantial computational savings with minimal or negligible accuracy loss. For example, approximately 75% spatial token reduction via spatial aggregation modules corresponds to only a 2.7 percentage point drop in R@1 for paragraph-to-video retrieval while halving GFLOPs on QuerYD (Ren et al., 2023).
Observed empirical best practices include favoring constant or increasing layerwise merge/drop schedules (to protect fine-grained features retained by early layers) and calibrating merge budgets to avoid over-reduction of small or semantically critical objects. Merged clusters reflect visual similarity alone; heavy merging can lead to loss of small or fine-grained objects, suggesting careful tuning when high fidelity is required (Pollard et al., 4 Jun 2025, Ren et al., 2023).
References
- StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding (Jin et al., 14 Dec 2025)
- Video, How Do Your Tokens Merge? (Pollard et al., 4 Jun 2025)
- TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding (Ren et al., 2023)