TransV: Efficient Token Transfer for Long Videos
- TransV is a token information transfer module that compresses and transfers redundant vision tokens into instruction tokens for efficient long-video understanding.
- It leverages gated cross-attention and staged token dropping to significantly reduce computational load while preserving multimodal reasoning across >10,000 frames.
- Experimental results demonstrate up to a 40.1% throughput increase with minimal accuracy loss, validating its effectiveness in large-scale vision-language models.
TransV is a token information transfer module designed for large-scale vision-LLMs (VLMs) targeting efficient and accurate long video understanding. Conceived within the TimeViper architecture, TransV addresses the marked redundancy of vision tokens arising in deep layers of hybrid Mamba-Transformer LLMs by transferring and compressing vision-token information into more computationally tractable instruction/text tokens. This mechanism enables the feasible processing of hour-long videos exceeding 10,000 frames without incurring substantial losses in multimodal reasoning or overall task accuracy (Xu et al., 20 Nov 2025).
1. Design Motivations and Objectives
Long-video understanding for vision-language tasks imposes two principal requirements: handling extremely elongated token sequences (arising from frame-wise video tokenization) and sustaining high-level reasoning capacity through sophisticated multimodal fusion. TimeViper’s empirical findings reveal a "vision-to-text aggregation" phenomenon—across increasing LLM depth, the bulk of vision-token information is integrated into the instruction/text tokens, resulting in severe vision-token redundancy in deeper layers. TransV operationalizes this insight by:
- Exploiting vision-to-text aggregation for in-LLM compression rather than limiting reduction at the vision projection stage.
- Systematically transferring vision-token content into instruction tokens mid- and late-network, then aggressively dropping redundant vision tokens.
- Allowing efficient support for contexts spanning over 10,000 video frames (or one hour of video), with negligible multimodal understanding loss, thereby dramatically improving computational throughput.
The design further incorporates gating to regulate the influx of visual information into the textual stream, optimizing for both efficiency and multimodal fusion integrity.
2. Architecture and Integration within TimeViper
TimeViper's pipeline begins with raw video frames sampled at 1 frame per second and encoded via a Vision Transformer (ViT). Each frame’s 768 output tokens are compressed to 16 using a projector and ToMe (Token Merging), yielding a vision-token feature matrix ( high). Instruction tokens () are then concatenated, and the joint sequence is fed through a hybrid LLM backbone comprising 27 Mamba-2 (state-space) layers, interleaved with 4 full self-attention layers and 25 MLP layers.
TransV is invoked twice within this backbone:
- Shallow depth (Layer 7): Uniformly drops 50% of vision tokens.
- Deep depth (Layer 39): Drops 90% of remaining vision tokens by attention-guided selection.
At each TransV site, the module compresses the pruned vision tokens into instruction tokens via gated cross-attention, passing only a reduced sequence into subsequent layers. By the end, as little as 5% of the original vision tokens persist in deep layers, substantially lowering downstream computational demands.
Specialized data packing accommodates variable batch lengths during training, and batching ensures uniform sampling and dropping rates across video instances.
3. Mathematical Formulation
TransV is composed of two principal operators: token dropping () and gated cross-attention transfer.
- Token Dropping: For layer , the current vision-token matrix undergoes dropping at a rate :
$\begin{cases}$
, if “uni” strategy \ , if “attn” strategy
$\end{cases}$
where and .
- Gated Cross-Attention Transfer:
The transferred instruction update is computed as
where is a learnable scalar gate, initialized to zero, explicitly regulating visual content influx.
No supplementary loss terms are employed: all parameters and transfer gating are implicitly learned via the end-to-end instructional-tuning cross-entropy objective. This design obviates the need for explicit information preservation or reconstruction penalties.
4. Algorithmic Workflow
A forward pass incorporating TransV proceeds as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Inputs: vision tokens X0⁰, instruction tokens X1⁰ for l in 1…L: if l in {7, 39}: p ← (l==7 ? 0.5 : 0.9) Td ← floor(p * size(X0)) if l==7: X0_drop ← UniformDrop(X0, Td) else: scores ← AttentionScores(query=X1, keys=X0) X0_drop ← TopKDrop(X0, scores, Td) # Cross‐attend and update X1 X1_tilde ← CrossAttn(Q=X1, KV=X0_drop) α_l ← learnable_scalar[l] X1 ← X1 + tanh(α_l) * X1_tilde # Drop X0 in place X0 ← X0 \ X0_drop end if # Hybrid Mamba/Attention/MLP Block X ← HybridBlock_l([X0; X1]) end for Output Y from the final X1 token positions |
Uniform-drop is applied at the shallow site (Layer 7), while attention-guided drop (retaining tokens most relevant to instructions) is applied at the deeper site (Layer 39). Variable sequence lengths resulting from dropping are managed by tailored data loaders.
5. Computational Impact and Scaling
In the baseline pure-Transformer scenario, computational cost for attention layers and MLP layers scales as:
TimeViper's hybrid backbone vastly improves upon this: each Mamba-2 layer is (linear in length), with only four self-attention layers.
TransV’s staged dropping achieves:
- At Layer 7: vision tokens reduced
- At Layer 39: further reduced to (90% drop of 50% residual)
Downstream of Layer 39, all LLM blocks operate on as little as 5% of the original vision tokens. This reduces post-TransV computational cost to approximately 5% of baseline.
Empirically, throughput increases by 40.1% when using TransV, with the maximum supported video context extending from 5,000 frames (no compression) to above 10,000 frames (>1 hour at 1 fps).
6. Experimental Results and Ablation Analyses
Comprehensive ablation studies illuminate the utility and trade-offs of TransV:
| Setting | Max Frames | VideoMME Acc | VDC Score | Charades mIoU |
|---|---|---|---|---|
| None | 5,000 | 58.8% | 39.7 | 40.5 |
| TDuni_7_0.5 | 8,000 | 57.3% | 39.0 | 26.1 |
| uni_7_0.5 + update X1 only | 8,000 | 56.7% | 38.9 | 38.1 |
| uni_7_0.5 + uni_39_0.9 | >10,000 | 56.2% | 39.1 | 37.9 |
| uni_7_0.5 + attn_39_0.9 | >10,000 | 56.6% | 39.0 | 37.9 |
Ablations show that naïvely dropping vision tokens (without transfer) severely harms vision-centric metrics (e.g., Charades mIoU drops from 40.5 to 26.1). Integrating TransV’s gated cross-attention substantially recovers this performance, while allowing context extension and improved efficiency. Using both shallow (50%) and deep (90%) TransV stages, TimeViper accommodates >10,000 frames with only a marginal two-point drop in VideoMME accuracy.
Furthermore, performance-vs-compression plots show negligible degradation for up to 100% dropping of deep vision tokens, supporting the central hypothesis of vision-token redundancy in deep layers.
7. Interpretability, Redundancy, and Attention Patterns
TransV’s operation provides new insights into the flow and redundancy of multimodal information in hybrid LLMs:
- Vision-to-Text Aggregation: Layerwise attention-blocking experiments reveal that for instruction-centric tasks (QA/grounding), almost all vision content is aggregated into instruction tokens by mid-to-late depths. In contrast, vision-centric tasks (captioning) depend on direct shallow-layer access to vision tokens.
- Redundancy Metrics: Performance is highly sensitive to token dropping in early layers, but becomes robust to massive dropping after the first self-attention block (layer 14); thus deep vision tokens are almost entirely redundant.
- Hybrid Attention Patterns: Analysis of Mamba vs. Transformer layers demonstrates diverse implicit attention distributions ("sparse", "local", "global"), with hybrid models (e.g., TimeViper-Nano) sustaining stronger vision token-centric attention compared to pure-Transformer baselines.
- Qualitative Evaluation: TransV-equipped models achieve high mean Intersection-over-Union (IoU = 0.75) for event localization, robust multi-choice QA, and detailed captioning in hour-long videos.
Summary
TransV is a stateless, in-LLM compress-and-transfer module that capitalizes on hybrid LLMs’ inherent vision-to-text information aggregation. By inserting gated cross-attention and aggressive token dropping at strategic depths, it enables deep models such as TimeViper to process previously infeasible video lengths (>10,000 frames) with only minimal task-specific degradation. The resulting model achieves up to 40.1% higher throughput relative to pure-Transformer backbones with negligible loss in core metrics. All details, mathematical formulations, ablation protocols, and interpretability analyses are sourced directly from the work of Xu et al. (2025) (Xu et al., 20 Nov 2025).