Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Gated Residual Tokenization for Dense Video Analysis

Updated 20 September 2025
  • Gated Residual Tokenization (GRT) is a framework that combines motion-aware gating with semantic merging to reduce redundant tokenization and computational overhead in video analysis.
  • GRT employs inter-frame pixel- and patch-level gating to filter static regions, utilizing SSIM-based thresholds and a pretrained Vision Transformer for dynamic content tokenization.
  • The framework demonstrates sub-linear token growth and efficient performance at high frame rates, enhancing applications like lecture comprehension and dense temporal QA.

Gated Residual Tokenization (GRT) is a framework for efficient, dense video understanding designed to alleviate the overhead associated with high-frame-rate video tokenization in large video LLMs (VLLMs). Unlike traditional methods reliant on uniform or keyframe sampling, GRT employs motion-aware gating and semantic scene merging to achieve sub-linear token count growth and to retain frame-by-frame temporal detail critical for tasks such as lecture comprehension and dense temporal QA.

1. Conceptual Foundation and Problem Motivation

GRT responds to two principal bottlenecks in high-FPS video modeling: the exponential computational growth of tokenizing every frame, and the redundancy introduced by static or slowly varying content. Conventional VLLMs typically discard most frame-level information due to the prohibitive cost of processing, restricting dense reasoning and fine-grained QA. GRT introduces a dual-stage solution—inter-frame motion gating and intra-scene semantic merging—which balances the need for temporal detail with practical compute and memory constraints (Zhang et al., 17 Sep 2025).

2. Motion-Compensated Inter-Gated Tokenization

The initial stage of GRT uses pixel-level and patch-level motion estimation to filter out redundant regions prior to tokenization.

  • Pixel-Level Residuals: For a scene ss with keyframe %%%%1%%%% and subsequent frames fs,k+1,f_{s,k+1}, \ldots, the residual information is captured as

Δfs,k+j=Ms,k+j(fs,k+jfs,k+j1),\Delta f_{s,k+j} = M_{s,k+j} \odot (f_{s,k+j} - f_{s,k+j-1}),

where Ms,k+jM_{s,k+j} is a binary mask indicating dynamic regions, and \odot is element-wise multiplication.

  • Patch-Level Gating via SSIM: Each frame is divided into patches Ps,j(n)P_{s,j}^{(n)}. For each patch, the mask Ms,j(n)M_{s,j}^{(n)} is set to 1 if structural similarity (SSIM) with the corresponding patch in the previous frame falls below a predefined threshold τ\tau:

Ms,j(n)={1,if SSIM(Ps,j(n),Ps,j1(n))<τ 0,otherwiseM_{s,j}^{(n)} = \begin{cases} 1, & \text{if } \mathrm{SSIM}\big(P_{s,j}^{(n)}, P_{s,j-1}^{(n)}\big) < \tau \ 0, & \text{otherwise} \end{cases}

Only patches with Ms,j(n)=1M_{s,j}^{(n)} = 1 are tokenized via the convolutional branch of a pretrained Vision Transformer (ViT), while static patches are zeroed and retained only via positional encoding.

  • Sub-Linear Growth: By gating static regions before tokenization, both the total compute and resulting token sequence length grow sub-linearly with frame rate, enabling scalable dense video analysis.

3. Semantic-Scene Intra-Tokenization Merging

After gating, significant redundancy persists across scenes consisting of many static frames. GRT addresses this via:

  • Key-Token / P-Token Scheme: Each scene maintains a key-token set Ts,k\mathcal{T}_{s,k} for its key frame, and a sequence of P-token sets for dynamic patches in subsequent frames.
  • Semantic Distance-Based Merging: To determine mergeability of consecutive scenes (s,t)(s, t), a semantic similarity metric (cosine distance or Jensen-Shannon divergence) is computed between their key-token distributions:

d(Ts,k,Tt,k)=1μ(Ts,k),μ(Tt,k)μ(Ts,k)μ(Tt,k)d(\mathcal{T}_{s,k}, \mathcal{T}_{t,k}) = 1 - \frac{\langle \mu(\mathcal{T}_{s,k}), \mu(\mathcal{T}_{t,k}) \rangle}{\|\mu(\mathcal{T}_{s,k})\|\|\mu(\mathcal{T}_{t,k})\|}

If d<δd < \delta (or JSD(Ts,k,Tt,k)<δJSD(\mathcal{T}_{s,k}, \mathcal{T}_{t,k}) < \delta), scene tt is merged with ss, and their key-token embeddings are averaged.

  • Dynamic Semantics Preservation: Only frames or scenes with sufficient semantic novelty are retained independently in the token sequence, minimizing redundancy without loss of dynamic event information.

4. Mathematical Formulation

The overall tokenization and merging process can be summarized as:

  • At each frame, the gate vector Gs,jG_{s,j} defines which patches to embed.
  • For tokenized patches, en=Wcpn+bce_n = W_c p_n + b_c (where WcW_c, bcb_c are learned weights) if dynamic, and e~n=PE(en)\tilde{e}_n = \mathrm{PE}(e_n) otherwise.
  • Scene-level token sequences are recursively merged as dictated by semantic similarity, with the merged key-token representing the average embedding and P-tokens concatenated.

5. Experimental Performance and Scalability

On the DIVE (Dense Information Video Evaluation) benchmark:

  • Model Efficiency: At 1 FPS, GRT reduces tokenization time by 46.4% versus baseline (0.0226 s vs 0.0487 s); similar reductions hold at higher frame rates.
  • Quality Metrics: A 0.5B-parameter GRT model achieves a MOS of 2.50, outperforming the 7B LLaVA-Video baseline (MOS 1.47) despite significantly less compute.
  • Positive Scaling: Both architectural stages of GRT become progressively more effective as frame rate increases, leading to larger token savings and maintained accuracy for dense QA demands.

6. Context and Broader Implications

GRT enables video LLMs to operate on frame-by-frame timescales without being hampered by token sequence length or processing latency. This is crucial for applications like instructional video analysis, surveillance, and lecture QA, where information is frequently encoded at maximal frame granularities. GRT’s use of motion gating and semantic merging generalizes beyond videos, providing a blueprint for modality-specific tokenization in various dense sequence domains.

7. Relationship to Other Gated Residual Architectures

GRT draws technical parallels with channel-wise gated residual techniques in binary neural networks (Shen et al., 2019), logic-gated residual blocks for hardware efficiency (Nguyen et al., 24 Jan 2025), and context-sensitive residual gating found in transformers (Dhayalkar, 22 May 2024). In all cases, the distinctive feature is the application of gating to control the flow of residual information—whether at token, patch, or channel granularity—for efficient, robust representation.


GRT represents a principled approach for high-FPS, dense video understanding, combining pixel-level gating with semantic token merging to achieve scalable tokenization and maintain fine-grained temporal precision. Its mechanisms are validated via comprehensive empirical evaluation and tightly integrate with broader developments in residual gating methodology.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gated Residual Tokenization (GRT).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube