VGent: Modular Multimodal Video & Vision Framework
- VGent is a multimodal AI framework that employs modular architectures and graph-based retrieval to address long video question answering and visual grounding challenges.
- Its video QA version constructs offline entity graphs and leverages subquery decomposition to filter relevant clips, achieving up to +5.4% accuracy improvements.
- The visual grounding design separates high-level reasoning from low-level prediction using a frozen MLLM encoder and a detector-based decoder, enabling efficient inference.
VGent refers to two prominent frameworks in multimodal AI, both emphasizing modular architectures for complex video and vision-language tasks. One, “VGent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding,” introduces a graph-based method for scaling video question answering (QA) to long sequences. The other, “VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction,” presents a modular encoder–decoder paradigm to disentangle reasoning by Multimodal LLMs (MLLMs) from low-level prediction in visual grounding. Both leverage structured representations and decomposition of reasoning to enhance performance and efficiency over prior approaches (Shen et al., 15 Oct 2025, Kang et al., 11 Dec 2025).
1. Modular Architectures for Multimodal Video and Image Understanding
The first VGent system (Shen et al., 15 Oct 2025) addresses the problem of long-video QA, where context scaling, temporal dependencies, and retrieval noise limit Large Video LLM (LVLM) performance. The pipeline consists of four core modules:
- Offline Video Graph Construction: The input video is split into temporally indexed clips; each is passed (with subtitles if available) to an LVLM to extract entities and textual descriptions. These entities are merged into global prototypes, and a (node, edge) graph is constructed, where nodes are clips, and undirected edges indicate shared prototype entities.
- Graph-based Retrieval: At query time, keywords extracted from the question are matched against global entities. Clips are selected by relevance on the entity-text graph using cosine similarity in BAAI/bge-large-en-v1.5 embedding space, scored and ranked, preserving temporal and semantic coherence.
- Structured Reasoning Step: Retrieved clips are further filtered using an intermediate LVLM-backed verification: the query is decomposed into sub-questions (e.g., presence checks or counts), and each clip is scored/binarized with respect to each sub-question. Clips with positive matches are retained, supporting explicit aggregation.
- Multimodal Augmented Generation: The final LVLM prompt incorporates the original question, filtered clip evidence, and an aggregated reasoning summary, enabling context-aware generation.
The second VGent system (Kang et al., 11 Dec 2025) targets visual grounding by explicitly separating reasoning (handled by a frozen, powerful MLLM) from low-level box selection (handled by a parallel, detector-based decoder):
- The encoder (MLLM) receives vision and text features (ViT or CNN outputs linearly mapped, concatenated with text embeddings) and produces hidden states.
- The decoder embeds detector box proposals, processes them via a stack of transformer layers with cross-attention to encoder activations, and scores each proposal using parallel feed-forward heads for binary relevance.
This structure avoids autoregressive decoding, supports fast inference, and allows upgrades to either encoder or decoder independently.
2. Structured and Graph-based Video Representation
VGent for long video QA introduces a reusable, query-independent graph for video representation:
- Node construction: Each video is partitioned into clips (each with frames), forming nodes .
- Edge construction: Entities extracted from each clip are merged into global prototypes based on cosine similarity (threshold ). Edges connect any clips sharing a prototype entity, yielding adjacency if .
- Feature encoding: Each node is described by the average embedding of its textual entities, descriptions, and subtitles (). Edges may be described by shared entity embeddings or omitted.
- Similarity metric: Cosine similarity is used both for entity merging and query–entity matching.
This sparse, semantically-structured graph allows for scalable, context-preserving retrieval that generalizes across queries and videos.
3. Retrieval, Intermediate Reasoning, and Generation
Retrieval
Query is processed by the LVLM to extract keywords . For each keyword and each global prototype entity , the similarity is computed; if (), all clips sharing are added to the candidate set . The retrieved clips are ranked by average keyword–annotation similarity, and the top are selected.
Intermediate Reasoning
As LVLMs can be easily confounded by hard negatives, VGent prompts the LVLM to decompose and into subqueries (e.g., existence or count of an entity/action). For each candidate clip , the LVLM answers each subquery, producing which is binary or integer-valued. Clips are retained in the pruned set if any subquery is answered positively, and this set is truncated to .
The LVLM then generates a structured reasoning summary by aggregating all pairs across retained clips.
Generation
The final answer is produced by prompting the LVLM with , filtered evidence , and the structured summary. The generation models:
4. Modular Reasoning–Prediction Disentanglement for Visual Grounding
VGent for visual grounding (Kang et al., 11 Dec 2025) separates high-level reasoning (frozen MLLM encoder) from low-level prediction (detector-based parallel decoder). Key components include:
- Encoder: A pretrained MLLM (e.g., Qwen2.5-VL-7B) optionally fine-tuned with policy-gradient RL for multi-target tasks (QuadThinker), then frozen.
- Decoder: N box queries (from an object detector such as UPN, GLEE, or SAM) are projected and cross-attended to encoder features across layers. Each box is scored for targetness using a binary classifier.
- Loss Functions: Supervision is provided by binary cross-entropy against IoU-based positive/negative labels, and, when segmentation information is available, mask-aware labels (intersection-over-area with GT mask union).
- Global Target Recognition: Additional learnable queries aid with global cues (total and positive counts), propagating holistic information via decoder self-attention.
- No Autoregression: All proposals are processed in parallel, enabling constant inference time regardless of the number of targets.
5. Specialized Training Paradigms and Labeling Strategies
QuadThinker (RL-based Encoder Tuning)
To enhance the encoder's multi-target reasoning before freezing, QuadThinker employs a customized reward structure:
- Rewards for proper answer formatting, valid counts, and json output.
- Accuracy rewards for correct quadrant/global counts and fine-grained match of predicted boxes to ground truth (using IoU, L1 distance, and center-point proximity).
- The policy-gradient objective maximizes expected total reward over output sequences:
Mask-aware Label Assignment
To address detection–segmentation ambiguity:
- Assign both box-aware label (IoU threshold ) and mask-aware label (IoA threshold using predicted/GT masks).
- Decoder outputs dual scores, each supervised by BCE loss. Total loss is .
Global Count Supervision
Learnable queries regress the total number of targets and positive boxes, regularized by loss, promoting holistic selection.
6. Empirical Performance and Computational Efficiency
Long Video QA (VGent, (Shen et al., 15 Oct 2025))
- On MLVU with base LVLMs, VGent yields +3.0%–5.4% accuracy improvements (e.g., LongVU +5.4%, LLaVA-Video +3.0%).
- Outperforms prior state-of-the-art RAG methods (Video-RAG and others) by +8.6% on MLVU.
- Offline graph construction: 20.13 seconds per video minute. Online retrieval+reasoning+generation: 3.93 seconds per video minute. Total system is 1.7× faster than Video-RAG for multi-question inference.
| Model (MLVU, no subtitles) | Base | +VGent | Δ |
|---|---|---|---|
| InternVL2.5 (2B) | 56.7 | 61.1 | +4.4 |
| Qwen2.5-VL (3B) | 66.2 | 70.4 | +4.2 |
| LongVU (7B) | 65.4 | 70.8 | +5.4 |
| Qwen2-VL (7B) | 65.7 | 70.3 | +4.6 |
| LLaVA-Video (7B) | 69.5 | 72.5 | +3.0 |
| Qwen2.5-VL (7B) | 68.8 | 72.1 | +3.3 |
Visual Grounding (VGent, (Kang et al., 11 Dec 2025))
- ORES/MaskGroups-HQ: VGent achieves F1 71.47% vs. RAS 50.89% (+20.58%); gIoU/cIoU improvements of +8.22/+5.83%.
- RefCOCO/+g: mean accuracy 90.1% vs. Qwen2.5-VL-7B’s 86.6% (+3.5%).
- Inference time is constant with respect to number of proposals; encoder+decoder: 0.696s per sample (8×A100), detector proposals: 0.213–1.154s, depending on the detector.
7. Limitations and Future Directions
Both versions of VGent retain some limitations:
- The video graph construction is text-centric (entity and subtitle based), with no direct use of frame-level visual features. Integrating visual embeddings could refine the graph but increases computational cost (Shen et al., 15 Oct 2025).
- Performance upper-bound is set by the underlying base LVLM or MLLM; as foundational models improve, VGent frameworks are expected to yield further gains.
- In detection–segmentation tasks, mask-aware supervision helps but does not fully close the gap implied by full pixel-level reasoning.
- VGent’s modularity enables independent upgrading of reasoning or prediction components, suggesting rapid adaptation to future advances in detectors or MLLMs (Kang et al., 11 Dec 2025).
Across these regimes, VGent demonstrates that injecting structure—via graphs for video retrieval/reasoning, and explicit reasoning–prediction separation for grounding—allows scalability, consistent efficiency, and accuracy improvements, substantially raising the bar in long video and multi-target visual language understanding.