Papers
Topics
Authors
Recent
2000 character limit reached

STC-Pruner: Accelerating VideoLLMs

Updated 7 December 2025
  • STC-Pruner is a hierarchical mechanism that prunes spatially and temporally redundant tokens from video streams to optimize computational efficiency.
  • It employs a fully differentiable scoring strategy based on cosine distances between tokens and computed temporal and spatial anchors.
  • Empirical results demonstrate that STC-Pruner can lower LLM pre-fill latency by up to 46.3% while retaining nearly 99% accuracy compared to uncompressed baselines.

STC-Pruner is a hierarchical token pruning mechanism developed as the second stage within the Streaming Token Compression (STC) framework, designed specifically for accelerating Streaming Video LLMs (VideoLLMs). Following initial acceleration via the STC-Cacher component, which operates at the @@@@1@@@@ (ViT) stage, STC-Pruner addresses inefficiencies in the LLM pre-filling phase by reducing both context length and associated computational loads. It operates by removing spatially and temporally redundant visual tokens from continuous video streams, thereby preserving only salient tokens necessary for downstream reasoning and minimizing pre-fill latency while maintaining accuracy. Its fully differentiable scoring mechanism is deterministic, query-agnostic, and optimized for real-time streaming scenarios (Wang et al., 30 Nov 2025).

1. Architectural Role and Workflow

STC-Pruner functions as the second token-level accelerator in STC, following feature extraction by STC-Cacher. The process begins with a set of dense visual token embeddings Zt={z1,…,zN}∈RN×D\mathbf{Z}_t = \{z_1, \ldots, z_N\} \in \mathbb{R}^{N \times D} for the current video frame chunk. The history buffer H\mathcal{H}, with fixed capacity WW, holds summary (anchor) vectors encoding recent frame context.

The primary workflow of STC-Pruner is as follows:

  1. Anchor Computation: Computes temporal anchor atemporal=1∣H∣∑h∈Hha_{\mathrm{temporal}} = \frac{1}{|\mathcal{H}|}\sum_{h \in \mathcal{H}} h (mean of historical anchors) and spatial anchor aspatial=1N∑j=1Nzja_{\mathrm{spatial}} = \frac{1}{N}\sum_{j=1}^N z_j (mean of current token set).
  2. Token Scoring: For each token zjz_j, calculates dynamics score S(zj)=α dcos(zj,atemporal)+(1−α)dcos(zj,aspatial)S(z_j) = \alpha\, d_{\mathrm{cos}}(z_j, a_{\mathrm{temporal}}) + (1-\alpha) d_{\mathrm{cos}}(z_j, a_{\mathrm{spatial}}), employing cosinedistance dcos(u,v)=1−u⋅v∥u∥∥v∥d_{\mathrm{cos}}(u, v) = 1 - \frac{u \cdot v}{\|u\|\|v\|}.
  3. Pruning: Retains top-kk tokens with highest S(zj)S(z_j), where k=⌊N (1−RPruner)⌋k = \lfloor N\, (1-R_\mathrm{Pruner})\rfloor, controlled by the user-specified pruning ratio RPruner∈[0,1]R_\mathrm{Pruner} \in [0, 1].
  4. History Update: Adds new spatial anchor to H\mathcal{H}, evicting oldest if necessary.

This sequence eliminates tokens semantically similar to the background (spatial redundancy) or previously observed content (temporal redundancy), optimizing for bandwidth and compute efficiency during context-building in the LLM (Wang et al., 30 Nov 2025).

2. Inputs, Outputs, and Hyperparameters

The operational interface for STC-Pruner consists of:

  • Input: Dense token sequence Zt\mathbf{Z}_t (NN tokens, DD-dimensional), and history buffer H\mathcal{H} (size WW) of summary anchors.
  • Hyperparameters:
    • Pruning ratio RPrunerR_\mathrm{Pruner}: proportion of tokens removed per frame.
    • Balance factor α\alpha: trade-off between temporal and spatial novelty in scoring.
  • Output: Pruned token set Zt′∈Rk×D\mathbf{Z}_t' \in \mathbb{R}^{k \times D}, where k=⌊N(1−RPruner)⌋k = \lfloor N (1-R_\mathrm{Pruner})\rfloor, and updated history buffer.

The design objective is minimization of ∣Zt′∣|\mathbf{Z}_t'| without compromising LLM prediction fidelity. Notably, STC-Pruner is strictly backward-looking, eschewing future context (lookahead), thus addressing streaming constraints.

3. Scoring Mechanism and Mathematical Formalism

The central operation in STC-Pruner is the per-token dynamics (novelty) score computation, designed to capture both spatial deviation from the frame anchor and temporal deviation from historical anchors. Formally:

  • Anchor Computation:

atemporal=1∣H∣∑h∈Hh,aspatial=1N∑j=1Nzja_{\mathrm{temporal}} = \frac{1}{|\mathcal{H}|} \sum_{h \in \mathcal{H}} h, \qquad a_{\mathrm{spatial}} = \frac{1}{N} \sum_{j=1}^N z_j

  • Cosine Distance:

dcos(u,v)=1−u⋅v∥u∥∥v∥d_{\mathrm{cos}}(u, v) = 1 - \frac{u \cdot v}{\|u\| \|v\|}

  • Dynamics Score:

S(zj)=α dcos(zj,atemporal)+(1−α) dcos(zj,aspatial)S(z_j) = \alpha\, d_{\mathrm{cos}}(z_j, a_{\mathrm{temporal}}) + (1 - \alpha)\, d_{\mathrm{cos}}(z_j, a_{\mathrm{spatial}})

  • Pruning Budget:

k=⌊N(1−RPruner)⌋,Zt′=TopK({zj},k, key=S(zj))k = \Big\lfloor N (1 - R_\mathrm{Pruner}) \Big\rfloor, \qquad \mathbf{Z}_t' = \mathrm{TopK}\big(\{z_j\}, k,\, \mathrm{key}=S(z_j)\big)

Hyperparameter α\alpha modulates prioritization between temporal and spatial information. Larger RPrunerR_\mathrm{Pruner} implies more aggressive compression. The top-kk procedure is implemented efficiently using O(N)O(N) selection algorithms.

4. Empirical Performance and Ablation

Systematic evaluation on OVO-Bench (EPM/STU/REC subsets, EgoSchema) and end-to-end tests with the ReKV framework substantiates the competitive performance of STC-Pruner. Key ablation results:

Setting EPM STU REC Avg
Only aspatiala_{\mathrm{spatial}} 50.5 47.2 25.8 41.2
Only atemporala_{\mathrm{temporal}} 51.5 47.8 24.1 41.1
Joint (α=0.5\alpha = 0.5) 51.2 48.9 25.9 42.0

A hyperparameter sweep on α\alpha reveals optimal accuracy for α\alpha near 0.5 (joint weighting), or slightly favoring spatial information, corroborating that leveraging both spatial and temporal signals enhances downstream LLM robustness. For RPruner=0.75R_\mathrm{Pruner} = 0.75, LLM pre-fill latency drops by approximately 46.3%46.3\% with accuracy retention at about 99%99\% relative to uncompressed baselines (Wang et al., 30 Nov 2025).

5. Computational Efficiency and Implementation Aspects

All anchor computations and cosine distance calculations are performed in a vectorized manner over the NN tokens, with anchor–token pairwise similarities computed via normalized matrix–vector multiplications. Efficient TopK selection further guarantees that the compression overhead remains sublinear relative to the original context length. The history buffer H\mathcal{H}, being of small fixed size (typically W≈8W \approx 8–16), incurs negligible memory impact.

STC-Pruner is query-agnostic; it does not employ query information nor future frames for decision-making, strictly adhering to streaming compatibility. The fully differentiable scoring procedure facilitates extension to gradient-based learning or tuning, although in its current form all selection is non-learned and rule-based.

6. Practical Significance and Applications

By dramatically reducing visual token sequence length prior to LLM context building, STC-Pruner lowers both self-attention complexity (O(n2)O(n^2) in uncompressed sequence length) and key-value cache requirements. The mechanism is tailored to dense, redundant video streams typical in continuous video understanding tasks for VideoLLMs, where rapid adaptation to novel content and compute tractability are critical.

Its plug-and-play operation allows seamless integration with diverse streaming VideoLLM architectures and frameworks, and its deterministic, parameter-free core aids deployment in latency-sensitive and resource-constrained environments. End-to-end, the combined STC framework, chiefly through the contribution of STC-Pruner, delivers practical acceleration gains with negligible degradation in perceptual or task accuracy (Wang et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to STC-Pruner.