SenseNova-MARS: Agentic Vision-Language Framework

Updated 2 January 2026

SenseNova-MARS is a multimodal agentic vision-language system enabling interleaved reasoning and dynamic tool use for complex, high-resolution tasks.
It integrates a frozen vision encoder with a reasoning core and tool-invocation module to sequentially execute text search, image search, and image cropping via reinforcement learning.
The framework utilizes BN-GSPO for optimized tool selection and demonstrates state-of-the-art performance on benchmarks like HR-MMSearch and V* Bench.

SenseNova-MARS is a multimodal agentic vision-language framework designed to enable seamless interleaved high-level reasoning and dynamic tool use for vision-LLMs (VLMs). The system empowers a pretrained VLM to strategically coordinate image search, text search, and image cropping tools via reinforcement learning, focusing on complex, knowledge-intensive, and visually demanding tasks that require coordinated tool use beyond text-oriented reasoning or isolated tool invocation (Chng et al., 30 Dec 2025).

1. System Architecture

SenseNova-MARS is constructed on top of a frozen vision encoder ("vision backbone") and a multimodal projector, which together produce joint image–text embeddings. This foundation supports a two-tiered system:

Reasoning Core: A LLM (exemplar: Qwen3-VL-8B) autoregressively processes the cumulative interaction state—including prompts, model-generated > steps, and tool outputs—to sequentially generate both natural-language reasoning and structured tool calls. > > - Tool-Invocation Module: Provides a lightweight interface for invoking precisely one tool per reasoning turn, using a JSON-parameter schema for reproducibility and structured action tracing. The available toolset comprises: > - text_search(query: string) → top-5 summarized web snippets > - image_search(image_index: int) → set of related image thumbnails and captions > - image_crop(bbox: [x1, y1, x2, y2], image_index: int) → cropped image region > > After every tool action, the resulting observation—textual or visual—is appended to the interaction history, incrementally informing subsequent reasoning and tool selection. The interaction protocol enforces a cycle: exactly one <think>→<tool_call> per step (non-final) and a single reasoning step plus answer in the terminal step. > > ## 2. Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) > > The RL optimization component of SenseNova-MARS is based on BN-GSPO, which extends Group Sequence Policy Optimization (GSPO) for improved stability and effectiveness on heterogeneous, multi-tool trajectories. > > For each trajectory $\tau$ in a batch, BN-GSPO introduces two levels of advantage normalization: > > 1. Group (Prompt-Level) Normalization: For each prompt $x_b$ with sampled trajectories $\{y_{b,g}\}$ , normalize the reward $r_{b,g} = R(\tau_{b,g})$ within the group: > > $\bar{A}_{b,g} = \frac{r_{b,g} - \mathrm{mean}_{g'} r_{b,g'}}{\mathrm{std}_{g'} r_{b,g'}}$ > > 2. Batch-Wide Normalization: Normalize across all batch elements: > > $\tilde{A}_{b,g} = \frac{\bar{A}_{b,g} - \mathrm{mean}_{b',g'} \bar{A}_{b',g'}}{\mathrm{std}_{b',g'} \bar{A}_{b',g'}}$ > > The length-normalized importance ratio $s_{b,g}(\theta)$ adjusts for variable-length sequences: > > $s_{b,g}(\theta) = \left[ \frac{\pi_\theta(y_{b,g}|x_b)}{\pi_{\theta_\text{old}}(y_{b,g}|x_b)} \right]^{1/|y_{b,g}|}$ > > The clipped training objective, penalized by $\beta$ times the KL divergence from a reference policy, is: > > $J(\theta) = \mathbb{E}_{b,g} \left[ \min(s_{b,g}(\theta) \cdot \tilde{A}_{b,g}, \mathrm{clip}(s_{b,g}(\theta), 1-\epsilon_\text{low}, 1+\epsilon_\text{high}) \cdot \tilde{A}_{b,g}) \right] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_\text{ref})$ > > The composite sequence-level reward $R(\tau)$ combines: > > - $R_{\text{acc}} = 1$ if the answer is correct, per GPT-4o LLM-judge; 0 otherwise. > > - $R_{\text{format}} = 0.5$ if the format protocol (<think>→<tool_call>, JSON-conforming) is strictly obeyed; 0 otherwise. > > ## 3. Dynamic Tool Integration and Agentic Reasoning > > At each reasoning step $t$ , the agent processes the complete multimodal history $\mathcal{O}_t$ and chooses an action $a_t$ from the discrete action space: > > - $A = \{$ text_search $(q)$ , image_search $(i)$ , image_crop $(\text{bbox}, i)$ , produce final answer $\}$ > > The enforced protocol requires each non-terminal step to output a reasoning thought and exactly one tool action, while the final step must generate a reasoning thought and produce the answer. Correct adherence is rewarded via $R_{\text{format}}$ . Reinforcement learning with BN-GSPO allows the agent to learn optimal policies for tool selection and invocation order, such as resolving pixel-level queries with image_crop or retrieving external information via search tools. This interleaved tool-reasoning cycle is necessary to address knowledge-intensive and visually complex questions insufficiently solvable via isolated tool or static reasoning approaches. > > ## 4. HR-MMSearch Benchmark > > HR-MMSearch is a high-resolution, knowledge-intensive benchmark created to evaluate search-driven multimodal reasoning agents. Core characteristics include: > > - 305 images at native 4K resolution, covering 8 thematic domains (Sports, Culture, Science & Technology, Business & Finance, Games, Academic Research, Geography & Travel, Others). > > - Each image paired with a single manually crafted question that is search-driven, knowledge-intensive, and targets small or inconspicuous content (<5% of image area) unanswerable from pixels alone. > > - At least one tool invocation is requisite for each query, with quality control and difficulty assessment performed by master’s-level professionals. > > - Difficulty split by agent pass rates: "Hard" set (0/8 success for Qwen2.5-VL-7B) and "Easy" set (≥1/8 success). > > - Evaluation via Agentic Pass@1 (zero-temperature decoding), judgements from GPT-4o. > > ## 5. Experimental Results and Comparative Performance > > SenseNova-MARS-8B demonstrates state-of-the-art results across major multimodal and search-oriented benchmarks. The table below summarizes the Pass@1% on MMSearch and HR-MMSearch, comparing SenseNova-MARS-8B to relevant baseline and proprietary models: > > | Model | MMSearch | HR-MMSearch | > |----------------------------|----------|-------------| > | Qwen3-VL-8B (direct) | 11.7 | 12.1 | > | Qwen3-VL-8B (zero-shot agent)| 47.4 | 27.9 | > | MMSearch-R1 (open-source) | 53.8 | 20.3 | > | GPT-5 (proprietary agent) | 52.6 | 38.4 | > | Gemini-3-Flash (agent) | 62.6 | 41.6 | > | SenseNova-MARS-8B (agent) | 67.8 | 41.6 | > > Across high-resolution perception tasks (Exact-Match%): > > | Model | V* Bench | HR-4K | HR-8K | MME-RealWorld | Avg. | > |------------------|----------|-------|-------|---------------|-------| > | Mini o3 | 88.2 | 77.5 | 73.3 | 65.5 | 76.1 | > | DeepEyesV2 | 81.8 | 77.9 | 73.8 | 64.9 | 74.6 | > | SenseNova-MARS-8B| 92.2 | 83.1 | 78.4 | 67.9 | 80.4 | > > These results reflect substantial gains over direct and zero-shot agentic VLM baselines, achieving +14.0 points on MMSearch and +3.0 points on HR-MMSearch relative to GPT-5. > > ## 6. Ablation Studies, Limitations, and Prospective Directions > > - BN-GSPO comparison: On the 7B model variant, BN-GSPO yields higher Pass@1 on MMSearch (56.7%) compared to GRPO (50.9%) and GSPO (53.8%), and increases V* Bench scores by 12–25 points. > > - RL Data Mixing: A hybrid RL dataset (search-oriented + perception data) surpasses specialized approaches; for example, exclusive perception training is detrimental to MMSearch, whereas hybridization provides +5.2 points. > > - Tool-Use Patterns: The framework adapts tool selection: cropping is almost exclusive on V* Bench, search is dominant on MMSearch, and a mixed approach is observed on HR-MMSearch. Reinforcement learning further reduces average tool invocations per query from ~4 to ~2. > > - Identified Limitations: Failure analysis identifies two leading sources of error: > - Retrieval noise (misinterpretation of similar data, e.g., "born in" vs. "based in") causes hallucinations. > - Tool parameter misconfigurations (imprecise cropping, overly generic queries) hinder fine-grained retrieval. > > - Future Work: Directions include expanding robust alignment strategies to counter distractor information, enriching the suite of available tools (e.g., OCR, knowledge-base access), and adopting self-supervised strategies to reduce the dependence on costly human-confirmed data. > > SenseNova-MARS substantiates the potential for end-to-end RL with BN-GSPO to produce multimodal agentic systems that emulate human-like proficiency in dynamic tool-use and multimodal reasoning, establishing new standards on complex, high-resolution, search-driven visual tasks (Chng et al., 30 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SenseNova-MARS.

SenseNova-MARS: Agentic Vision-Language Framework

1. System Architecture

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics