STC-Pruner: Accelerating VideoLLMs

Updated 7 December 2025

STC-Pruner is a hierarchical mechanism that prunes spatially and temporally redundant tokens from video streams to optimize computational efficiency.
It employs a fully differentiable scoring strategy based on cosine distances between tokens and computed temporal and spatial anchors.
Empirical results demonstrate that STC-Pruner can lower LLM pre-fill latency by up to 46.3% while retaining nearly 99% accuracy compared to uncompressed baselines.

STC-Pruner is a hierarchical token pruning mechanism developed as the second stage within the Streaming Token Compression (STC) framework, designed specifically for accelerating Streaming Video LLMs (VideoLLMs). Following initial acceleration via the STC-Cacher component, which operates at the @@@@1@@@@ (ViT) stage, STC-Pruner addresses inefficiencies in the LLM pre-filling phase by reducing both context length and associated computational loads. It operates by removing spatially and temporally redundant visual tokens from continuous video streams, thereby preserving only salient tokens necessary for downstream reasoning and minimizing pre-fill latency while maintaining accuracy. Its fully differentiable scoring mechanism is deterministic, query-agnostic, and optimized for real-time streaming scenarios (Wang et al., 30 Nov 2025).

1. Architectural Role and Workflow

STC-Pruner functions as the second token-level accelerator in STC, following feature extraction by STC-Cacher. The process begins with a set of dense visual token embeddings $\mathbf{Z}_t = \{z_1, \ldots, z_N\} \in \mathbb{R}^{N \times D}$ for the current video frame chunk. The history buffer $\mathcal{H}$ , with fixed capacity $W$ , holds summary (anchor) vectors encoding recent frame context.

The primary workflow of STC-Pruner is as follows:

Anchor Computation: Computes temporal anchor $a_{\mathrm{temporal}} = \frac{1}{|\mathcal{H}|}\sum_{h \in \mathcal{H}} h$ (mean of historical anchors) and spatial anchor $a_{\mathrm{spatial}} = \frac{1}{N}\sum_{j=1}^N z_j$ (mean of current token set).
Token Scoring: For each token $z_j$ , calculates dynamics score $S(z_j) = \alpha\, d_{\mathrm{cos}}(z_j, a_{\mathrm{temporal}}) + (1-\alpha) d_{\mathrm{cos}}(z_j, a_{\mathrm{spatial}})$ , employing cosinedistance $d_{\mathrm{cos}}(u, v) = 1 - \frac{u \cdot v}{\|u\|\|v\|}$ .
Pruning: Retains top- $k$ tokens with highest $S(z_j)$ , where $k = \lfloor N\, (1-R_\mathrm{Pruner})\rfloor$ , controlled by the user-specified pruning ratio $R_\mathrm{Pruner} \in [0, 1]$ .
History Update: Adds new spatial anchor to $\mathcal{H}$ , evicting oldest if necessary.

This sequence eliminates tokens semantically similar to the background (spatial redundancy) or previously observed content (temporal redundancy), optimizing for bandwidth and compute efficiency during context-building in the LLM (Wang et al., 30 Nov 2025).

2. Inputs, Outputs, and Hyperparameters

The operational interface for STC-Pruner consists of:

Input: Dense token sequence $\mathbf{Z}_t$ ( $N$ tokens, $D$ -dimensional), and history buffer $\mathcal{H}$ (size $W$ ) of summary anchors.
Hyperparameters:
- Pruning ratio $R_\mathrm{Pruner}$ : proportion of tokens removed per frame.
- Balance factor $\alpha$ : trade-off between temporal and spatial novelty in scoring.
Output: Pruned token set $\mathbf{Z}_t' \in \mathbb{R}^{k \times D}$ , where $k = \lfloor N (1-R_\mathrm{Pruner})\rfloor$ , and updated history buffer.

The design objective is minimization of $|\mathbf{Z}_t'|$ without compromising LLM prediction fidelity. Notably, STC-Pruner is strictly backward-looking, eschewing future context (lookahead), thus addressing streaming constraints.

3. Scoring Mechanism and Mathematical Formalism

The central operation in STC-Pruner is the per-token dynamics (novelty) score computation, designed to capture both spatial deviation from the frame anchor and temporal deviation from historical anchors. Formally:

Anchor Computation:

$a_{\mathrm{temporal}} = \frac{1}{|\mathcal{H}|} \sum_{h \in \mathcal{H}} h, \qquad a_{\mathrm{spatial}} = \frac{1}{N} \sum_{j=1}^N z_j$

Cosine Distance:

$d_{\mathrm{cos}}(u, v) = 1 - \frac{u \cdot v}{\|u\| \|v\|}$

Dynamics Score:

$S(z_j) = \alpha\, d_{\mathrm{cos}}(z_j, a_{\mathrm{temporal}}) + (1 - \alpha)\, d_{\mathrm{cos}}(z_j, a_{\mathrm{spatial}})$

Pruning Budget:

$k = \Big\lfloor N (1 - R_\mathrm{Pruner}) \Big\rfloor, \qquad \mathbf{Z}_t' = \mathrm{TopK}\big(\{z_j\}, k,\, \mathrm{key}=S(z_j)\big)$

Hyperparameter $\alpha$ modulates prioritization between temporal and spatial information. Larger $R_\mathrm{Pruner}$ implies more aggressive compression. The top- $k$ procedure is implemented efficiently using $O(N)$ selection algorithms.

4. Empirical Performance and Ablation

Systematic evaluation on OVO-Bench (EPM/STU/REC subsets, EgoSchema) and end-to-end tests with the ReKV framework substantiates the competitive performance of STC-Pruner. Key ablation results:

Setting	EPM	STU	REC	Avg
Only $a_{\mathrm{spatial}}$	50.5	47.2	25.8	41.2
Only $a_{\mathrm{temporal}}$	51.5	47.8	24.1	41.1
Joint ( $\alpha = 0.5$ )	51.2	48.9	25.9	42.0

A hyperparameter sweep on $\alpha$ reveals optimal accuracy for $\alpha$ near 0.5 (joint weighting), or slightly favoring spatial information, corroborating that leveraging both spatial and temporal signals enhances downstream LLM robustness. For $R_\mathrm{Pruner} = 0.75$ , LLM pre-fill latency drops by approximately $46.3\%$ with accuracy retention at about $99\%$ relative to uncompressed baselines (Wang et al., 30 Nov 2025).

5. Computational Efficiency and Implementation Aspects

All anchor computations and cosine distance calculations are performed in a vectorized manner over the $N$ tokens, with anchor–token pairwise similarities computed via normalized matrix–vector multiplications. Efficient TopK selection further guarantees that the compression overhead remains sublinear relative to the original context length. The history buffer $\mathcal{H}$ , being of small fixed size (typically $W \approx 8$ –16), incurs negligible memory impact.

STC-Pruner is query-agnostic; it does not employ query information nor future frames for decision-making, strictly adhering to streaming compatibility. The fully differentiable scoring procedure facilitates extension to gradient-based learning or tuning, although in its current form all selection is non-learned and rule-based.

6. Practical Significance and Applications

By dramatically reducing visual token sequence length prior to LLM context building, STC-Pruner lowers both self-attention complexity ( $O(n^2)$ in uncompressed sequence length) and key-value cache requirements. The mechanism is tailored to dense, redundant video streams typical in continuous video understanding tasks for VideoLLMs, where rapid adaptation to novel content and compute tractability are critical.

Its plug-and-play operation allows seamless integration with diverse streaming VideoLLM architectures and frameworks, and its deterministic, parameter-free core aids deployment in latency-sensitive and resource-constrained environments. End-to-end, the combined STC framework, chiefly through the contribution of STC-Pruner, delivers practical acceleration gains with negligible degradation in perceptual or task accuracy (Wang et al., 30 Nov 2025).

Markdown Upgrade to Chat

References (1)

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to STC-Pruner.