Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

VideoMiner: Tree-Based Video Understanding

Updated 8 October 2025
  • VideoMiner is an iterative, reinforcement learning–augmented framework for key frame extraction in long videos using adaptive, tree-based segmentation and captioning.
  • It employs tree-based group relative policy optimization (T-GRPO) to dynamically select, refine, and prune video segments based on user queries and temporal coherence.
  • The framework achieves enhanced efficiency and accuracy in applications such as video summarization, question answering, and anomaly detection by balancing depth and computational load.

VideoMiner is an iterative, reinforcement learning–augmented framework for grounding key frames of hour-long videos within multimodal LLM (MM-LLM) pipelines. Its design fundamentally reconsiders the problem of long-video understanding: rather than processing uniformly sampled frames or employing basic hierarchical key frame extraction, VideoMiner organizes analysis as a tree-based progression. This architecture supports scalable, temporally coherent navigation from coarse video segments to fine-grained frames, all under direct supervision of a structured policy optimization module. The system’s primary innovation is the application of tree-based group relative policy optimization (T-GRPO), which enables adaptive, query-guided frame selection and efficient information reduction in dense, extended video streams (Cao et al., 7 Oct 2025).

1. Hierarchical Decomposition of Long Videos

VideoMiner addresses the inherent inefficiency of uniform frame sampling in multi-hour video content. The pipeline begins by segmenting the long video into a series of events, each defined by significant temporal discontinuities. Specifically, frames are first uniformly sampled and represented by normalized grayscale histograms. Scene boundaries are detected via the Bhattacharyya distance:

Dt=lnk=0255Ht(k)Ht+1(k)D_t = -\ln \sum_{k=0}^{255} \sqrt{H_t(k) \cdot H_{t+1}(k)}

where Ht(k)H_t(k) denotes the normalized grayscale histogram at level kk for frame tt. The K1K - 1 highest DtD_t values define change points, segmenting the video into KK events.

For each event, a vision-LLM (VLM) generates a descriptive caption, conditioned on a user’s query to ensure downstream relevance. Each caption is then embedded and clustered (e.g., via DBSCAN), yielding nodes in a hierarchical tree. The tree recursively nests from video-level, to event-level, to shot-level, and finally to frame-level, while preserving strict temporal coherence throughout all branches.

2. Policy Optimization: Tree-Based Group Relative Policy Optimization (T-GRPO)

VideoMiner employs T-GRPO, a bespoke reinforcement learning approach tailored to hierarchical, tree-based video structures. At every node, T-GRPO receives as input the current caption, the user query QQ, and the node’s tree depth. It must issue one of three decisions:

  • Accept: designate the node as an informative key frame candidate
  • Continue: further subdivide the node (triggering deeper segmentation and captioning)
  • Delete: prune the branch if it is deemed irrelevant.

Action selection is trained using node-level and tree-level rewards:

  • Format Reward: Ensures the output matches required answer structures
  • Length Reward: Encourages concise responses by comparing the output length lol_o to a target ltl_t through a Gaussian function:

rlength(lo)=ρexp((lolt)22σ2)r_{\text{length}}(l_o) = \rho \cdot \exp\left( - \frac{(l_o-l_t)^2}{2 \sigma^2} \right)

  • Action Reward: Applies distinct values for each possible action (accept/continue/delete) to balance efficiency and coverage.
  • Tree Growth Auxin (λauxin\lambda_{\text{auxin}}): Dynamically regulates expansion depth across the tree, further tuning the trade-off between thoroughness and computational load.

The total policy loss is formulated with clipped policy gradients and a KL-divergence regularization:

LT-GRPO=E[i,jmin(c,clip(c,1ϵ,1+ϵ))Advij]βDKL(πθπref)\mathcal{L}_{\text{T-GRPO}} = \mathbb{E} \left[ \sum_{i,j} \min\left(c, \text{clip}(c, 1-\epsilon, 1+\epsilon)\right) \cdot \text{Adv}_{ij} \right] - \beta \cdot D_{KL}(\pi_\theta || \pi_{\text{ref}})

where c=πθ(oijq)/πθold(oijq)c = \pi_\theta(o_{ij}|q)/\pi_{\theta_{\text{old}}}(o_{ij}|q) and Advij\text{Adv}_{ij} is the normalized advantage.

3. Iterative Segmentation, Captioning, and Clustering

The methodology progresses iteratively:

  • Scene Segmentation: Long videos are partitioned into events using the aforementioned histogram-based approach.
  • Event Captioning: Each video event is captioned by a VLM, explicitly conditioned on a task or user question to focus downstream clustering.
  • Semantic Clustering: Event captions are embedded and grouped using density-based algorithms (e.g., DBSCAN). This forms the next layer in the hierarchical tree, recursively driving further exploration as needed.
  • Hierarchical Navigation: T-GRPO policy determines, at each tree node, whether to accept a key frame, refine granularity further, or discard a branch. This allows for adaptive, query-sensitive frame selection at arbitrary temporal scales.

4. Performance Gains and Emergent Reasoning

VideoMiner demonstrates superior accuracy and efficiency in long-video understanding tasks, consistently outperforming prior end-to-end or hierarchical methods across benchmarks—including sports highlight detection and video question answering (Cao et al., 7 Oct 2025). An unanticipated finding is that T-GRPO incentivizes the model to generate chain-of-thought (CoT) reasoning paths during exploration, spontaneously yielding interpretable intermediate explanations as it traverses the tree.

The inclusion of tree growth auxin (λauxin\lambda_{\text{auxin}}) enables dynamic adjustment of search depth and breadth, conferring both computational efficiency (reducing unnecessary expansions) and improved key frame localization. The approach maintains temporal coherence and query-adaptivity at all levels of tree traversal.

5. Technical Framework and Mathematical Formulations

The technical core of VideoMiner is framed by precise mathematical definitions:

  • Scene Segmentation: Event boundaries are defined by maximizing Bhattacharyya distance between frame histograms.
  • Event Embedding: Captions are projected via a fixed VLM into a vector space for downstream clustering.
  • Policy Optimization: Reinforcement learning loss incorporates traditional advantage estimation, length and format rewards, and KL-divergence clippings to constrain policy drift.
  • Reward Structure: Action, length, and format rewards are independently tunable, enabling fine-grained reinforcement of desired output characteristics.

All steps are tightly integrated to maintain the balance between granularity, coverage of relevant content, and computational efficiency.

6. Applications, Impact, and Extensions

VideoMiner’s explicit targeting of redundant information and adaptive hierarchical reasoning renders it applicable to a wide range of long-video understanding scenarios:

  • Video Summarization and Highlight Detection: By focusing on events of interest and discarding redundancy, the system supports extraction of concise highlights in sports, entertainment, and surveillance.
  • Video Question Answering (QA): Query-driven navigation ensures that the frames and sub-events most relevant to a user’s question are prioritized, improving factual correctness and efficiency.
  • Automated Captioning and Anomaly Detection: Structured tree traversal supports both accurate summarization and discovery of anomalies, as irregular branches/events can be isolated and analyzed.
  • Human-Centered Interfaces: The preserved temporal coherence and chain-of-thought outputs facilitate explainability in educational, analytical, and interactive AI systems.

The public code release supports reproducibility and extension, enabling further research on MM-LLM–driven long-form video analysis.

7. Comparative Perspective and Future Work

VideoMiner’s tree-based, RL-guided adaptive key frame localization differentiates it from classical uniform sampling and prior hierarchical methods, which are vulnerable to redundancy and struggle with dynamic adaptation to content structure. By integrating VLM-based captioning, density-based clustering, and a policy model sensitive to temporal and semantic context, the framework achieves joint improvements in speed, accuracy, and interpretability.

A plausible implication is that further advances may emerge from more expressive reward structures, integration with self-supervised event boundary detection, and the extension of tree growth policies to additional multimodal cues (e.g., audio, motion). The observed emergence of chain-of-thought reasoning suggests fertile ground for future work at the intersection of video understanding and explainable AI.


In summary: VideoMiner is an advanced, modular framework for scalable hour-long video understanding. By leveraging tree-based iterative segmentation, captioning, clustering, and a reinforcement learning policy (T-GRPO) for node exploration, it substantially reduces information redundancy and adaptively grounds key frames in a temporally coherent, query-sensitive manner. These technical advances underpin its state-of-the-art performance across a range of demanding long-video analysis tasks (Cao et al., 7 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VideoMiner.