Papers
Topics
Authors
Recent
2000 character limit reached

VGent: Modular Multimodal Video & Vision Framework

Updated 16 December 2025
  • VGent is a multimodal AI framework that employs modular architectures and graph-based retrieval to address long video question answering and visual grounding challenges.
  • Its video QA version constructs offline entity graphs and leverages subquery decomposition to filter relevant clips, achieving up to +5.4% accuracy improvements.
  • The visual grounding design separates high-level reasoning from low-level prediction using a frozen MLLM encoder and a detector-based decoder, enabling efficient inference.

VGent refers to two prominent frameworks in multimodal AI, both emphasizing modular architectures for complex video and vision-language tasks. One, “VGent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding,” introduces a graph-based method for scaling video question answering (QA) to long sequences. The other, “VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction,” presents a modular encoder–decoder paradigm to disentangle reasoning by Multimodal LLMs (MLLMs) from low-level prediction in visual grounding. Both leverage structured representations and decomposition of reasoning to enhance performance and efficiency over prior approaches (Shen et al., 15 Oct 2025, Kang et al., 11 Dec 2025).

1. Modular Architectures for Multimodal Video and Image Understanding

The first VGent system (Shen et al., 15 Oct 2025) addresses the problem of long-video QA, where context scaling, temporal dependencies, and retrieval noise limit Large Video LLM (LVLM) performance. The pipeline consists of four core modules:

  1. Offline Video Graph Construction: The input video is split into temporally indexed clips; each is passed (with subtitles if available) to an LVLM to extract entities and textual descriptions. These entities are merged into global prototypes, and a (node, edge) graph is constructed, where nodes are clips, and undirected edges indicate shared prototype entities.
  2. Graph-based Retrieval: At query time, keywords extracted from the question are matched against global entities. Clips are selected by relevance on the entity-text graph using cosine similarity in BAAI/bge-large-en-v1.5 embedding space, scored and ranked, preserving temporal and semantic coherence.
  3. Structured Reasoning Step: Retrieved clips are further filtered using an intermediate LVLM-backed verification: the query is decomposed into sub-questions (e.g., presence checks or counts), and each clip is scored/binarized with respect to each sub-question. Clips with positive matches are retained, supporting explicit aggregation.
  4. Multimodal Augmented Generation: The final LVLM prompt incorporates the original question, filtered clip evidence, and an aggregated reasoning summary, enabling context-aware generation.

The second VGent system (Kang et al., 11 Dec 2025) targets visual grounding by explicitly separating reasoning (handled by a frozen, powerful MLLM) from low-level box selection (handled by a parallel, detector-based decoder):

  • The encoder (MLLM) receives vision and text features (ViT or CNN outputs linearly mapped, concatenated with text embeddings) and produces hidden states.
  • The decoder embeds detector box proposals, processes them via a stack of transformer layers with cross-attention to encoder activations, and scores each proposal using parallel feed-forward heads for binary relevance.

This structure avoids autoregressive decoding, supports fast inference, and allows upgrades to either encoder or decoder independently.

2. Structured and Graph-based Video Representation

VGent for long video QA introduces a reusable, query-independent graph for video representation:

  • Node construction: Each video is partitioned into mm clips ViV_i (each with KK frames), forming nodes V={v1,,vm}\mathcal{V} = \{v_1,\ldots,v_m\}.
  • Edge construction: Entities extracted from each clip are merged into global prototypes U\mathcal{U} based on cosine similarity (threshold τ=0.7\tau = 0.7). Edges connect any clips sharing a prototype entity, yielding adjacency Aij=1A_{ij}=1 if U(vi)U(vj)U(v_i) \cap U(v_j) \ne \emptyset.
  • Feature encoding: Each node is described by the average embedding of its textual entities, descriptions, and subtitles (fviRd\text{f}_{v_i} \in \mathbb{R}^d). Edges may be described by shared entity embeddings or omitted.
  • Similarity metric: Cosine similarity is used both for entity merging and query–entity matching.

This sparse, semantically-structured graph allows for scalable, context-preserving retrieval that generalizes across queries and videos.

3. Retrieval, Intermediate Reasoning, and Generation

Retrieval

Query QQ is processed by the LVLM to extract keywords K\mathcal{K}. For each keyword kk and each global prototype entity uUu \in \mathcal{U}, the similarity s(k,u)s(k,u) is computed; if s(k,u)>θs(k,u) > \theta (θ=0.5\theta=0.5), all clips sharing uu are added to the candidate set R\mathcal{R}. The retrieved clips are ranked by average keyword–annotation similarity, and the top N=20N=20 are selected.

Intermediate Reasoning

As LVLMs can be easily confounded by hard negatives, VGent prompts the LVLM to decompose QQ and K\mathcal{K} into subqueries Q\mathcal{Q} (e.g., existence or count of an entity/action). For each candidate clip viv_i, the LVLM answers each subquery, producing f(vi,qj)f(v_i,q_j) which is binary or integer-valued. Clips are retained in the pruned set R\mathcal{R}' if any subquery is answered positively, and this set is truncated to r=5r=5.

The LVLM then generates a structured reasoning summary by aggregating all (qj,f(vi,qj))(q_j, f(v_i, q_j)) pairs across retained clips.

Generation

The final answer is produced by prompting the LVLM with QQ, filtered evidence R\mathcal{R}', and the structured summary. The generation models:

p(aQ,R,summary)=t=1Tp(ata<t,Q,{clip_context},{reasoning})p(a \mid Q, \mathcal{R}', \text{summary}) = \prod_{t=1}^T p(a_t \mid a_{<t}, Q, \{\text{clip\_context}\}, \{\text{reasoning}\})

4. Modular Reasoning–Prediction Disentanglement for Visual Grounding

VGent for visual grounding (Kang et al., 11 Dec 2025) separates high-level reasoning (frozen MLLM encoder) from low-level prediction (detector-based parallel decoder). Key components include:

  • Encoder: A pretrained MLLM (e.g., Qwen2.5-VL-7B) optionally fine-tuned with policy-gradient RL for multi-target tasks (QuadThinker), then frozen.
  • Decoder: N box queries (from an object detector such as UPN, GLEE, or SAM) are projected and cross-attended to encoder features across LL layers. Each box is scored for targetness using a binary classifier.
  • Loss Functions: Supervision is provided by binary cross-entropy against IoU-based positive/negative labels, and, when segmentation information is available, mask-aware labels (intersection-over-area with GT mask union).
  • Global Target Recognition: Additional MM learnable queries aid with global cues (total and positive counts), propagating holistic information via decoder self-attention.
  • No Autoregression: All proposals are processed in parallel, enabling constant inference time regardless of the number of targets.

5. Specialized Training Paradigms and Labeling Strategies

QuadThinker (RL-based Encoder Tuning)

To enhance the encoder's multi-target reasoning before freezing, QuadThinker employs a customized reward structure:

  • Rewards for proper answer formatting, valid counts, and json output.
  • Accuracy rewards for correct quadrant/global counts and fine-grained match of predicted boxes to ground truth (using IoU, L1 distance, and center-point proximity).
  • The policy-gradient objective maximizes expected total reward over output sequences:

J(θ)=Eaπθ[Rtotal(a)],θJEa[Rtotal(a)θlogπθ(as)]J(\theta) = \mathbb{E}_{a \sim \pi_\theta}[R_{total}(a)],\quad \nabla_\theta J \approx \mathbb{E}_a[R_{total}(a) \nabla_\theta \log \pi_\theta(a\mid s)]

Mask-aware Label Assignment

To address detection–segmentation ambiguity:

  • Assign both box-aware label yiboxy_i^{box} (IoU threshold τbox=0.6\tau_{box}=0.6) and mask-aware label yimasky_i^{mask} (IoA threshold τmask=0.6\tau_{mask}=0.6 using predicted/GT masks).
  • Decoder outputs dual scores, each supervised by BCE loss. Total loss is Ldec=Lbox+LmaskL_{dec} = L_{box} + L_{mask}.

Global Count Supervision

Learnable queries regress the total number of targets and positive boxes, regularized by 1\ell_1 loss, promoting holistic selection.

6. Empirical Performance and Computational Efficiency

  • On MLVU with base LVLMs, VGent yields +3.0%–5.4% accuracy improvements (e.g., LongVU +5.4%, LLaVA-Video +3.0%).
  • Outperforms prior state-of-the-art RAG methods (Video-RAG and others) by +8.6% on MLVU.
  • Offline graph construction: 20.13 seconds per video minute. Online retrieval+reasoning+generation: 3.93 seconds per video minute. Total system is 1.7× faster than Video-RAG for multi-question inference.
Model (MLVU, no subtitles) Base +VGent Δ
InternVL2.5 (2B) 56.7 61.1 +4.4
Qwen2.5-VL (3B) 66.2 70.4 +4.2
LongVU (7B) 65.4 70.8 +5.4
Qwen2-VL (7B) 65.7 70.3 +4.6
LLaVA-Video (7B) 69.5 72.5 +3.0
Qwen2.5-VL (7B) 68.8 72.1 +3.3
  • ORES/MaskGroups-HQ: VGent achieves F1 71.47% vs. RAS13B_{13B} 50.89% (+20.58%); gIoU/cIoU improvements of +8.22/+5.83%.
  • RefCOCO/+g: mean accuracy 90.1% vs. Qwen2.5-VL-7B’s 86.6% (+3.5%).
  • Inference time is constant with respect to number of proposals; encoder+decoder: 0.696s per sample (8×A100), detector proposals: 0.213–1.154s, depending on the detector.

7. Limitations and Future Directions

Both versions of VGent retain some limitations:

  • The video graph construction is text-centric (entity and subtitle based), with no direct use of frame-level visual features. Integrating visual embeddings could refine the graph but increases computational cost (Shen et al., 15 Oct 2025).
  • Performance upper-bound is set by the underlying base LVLM or MLLM; as foundational models improve, VGent frameworks are expected to yield further gains.
  • In detection–segmentation tasks, mask-aware supervision helps but does not fully close the gap implied by full pixel-level reasoning.
  • VGent’s modularity enables independent upgrading of reasoning or prediction components, suggesting rapid adaptation to future advances in detectors or MLLMs (Kang et al., 11 Dec 2025).

Across these regimes, VGent demonstrates that injecting structure—via graphs for video retrieval/reasoning, and explicit reasoning–prediction separation for grounding—allows scalability, consistent efficiency, and accuracy improvements, substantially raising the bar in long video and multi-target visual language understanding.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to VGent.