2000 character limit reached
SenseNova-MARS: Agentic Vision-Language Framework
Updated 2 January 2026
- SenseNova-MARS is a multimodal agentic vision-language system enabling interleaved reasoning and dynamic tool use for complex, high-resolution tasks.
- It integrates a frozen vision encoder with a reasoning core and tool-invocation module to sequentially execute text search, image search, and image cropping via reinforcement learning.
- The framework utilizes BN-GSPO for optimized tool selection and demonstrates state-of-the-art performance on benchmarks like HR-MMSearch and V* Bench.
SenseNova-MARS is a multimodal agentic vision-language framework designed to enable seamless interleaved high-level reasoning and dynamic tool use for vision-LLMs (VLMs). The system empowers a pretrained VLM to strategically coordinate image search, text search, and image cropping tools via reinforcement learning, focusing on complex, knowledge-intensive, and visually demanding tasks that require coordinated tool use beyond text-oriented reasoning or isolated tool invocation (Chng et al., 30 Dec 2025).
1. System Architecture
SenseNova-MARS is constructed on top of a frozen vision encoder ("vision backbone") and a multimodal projector, which together produce joint image–text embeddings. This foundation supports a two-tiered system:
- Reasoning Core: A LLM (exemplar: Qwen3-VL-8B) autoregressively processes the cumulative interaction state—including prompts, model-generated > steps, and tool outputs—to sequentially generate both natural-language reasoning and structured tool calls.
>
> - Tool-Invocation Module: Provides a lightweight interface for invoking precisely one tool per reasoning turn, using a JSON-parameter schema for reproducibility and structured action tracing. The available toolset comprises:
> -
text_search(query: string) → top-5 summarized web snippets> -image_search(image_index: int) → set of related image thumbnails and captions> -image_crop(bbox: [x1, y1, x2, y2], image_index: int) → cropped image region> > After every tool action, the resulting observation—textual or visual—is appended to the interaction history, incrementally informing subsequent reasoning and tool selection. The interaction protocol enforces a cycle: exactly one <think>→<tool_call> per step (non-final) and a single reasoning step plus answer in the terminal step. > > ## 2. Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) > > The RL optimization component of SenseNova-MARS is based on BN-GSPO, which extends Group Sequence Policy Optimization (GSPO) for improved stability and effectiveness on heterogeneous, multi-tool trajectories. > > For each trajectory in a batch, BN-GSPO introduces two levels of advantage normalization: > > 1. Group (Prompt-Level) Normalization: For each prompt with sampled trajectories , normalize the reward within the group: > > > > 2. Batch-Wide Normalization: Normalize across all batch elements: > > > > The length-normalized importance ratio adjusts for variable-length sequences: > > > > The clipped training objective, penalized by times the KL divergence from a reference policy, is: > > > > The composite sequence-level reward combines: > > - if the answer is correct, per GPT-4o LLM-judge; 0 otherwise. > > - if the format protocol (<think>→<tool_call>, JSON-conforming) is strictly obeyed; 0 otherwise. > > ## 3. Dynamic Tool Integration and Agentic Reasoning > > At each reasoning step , the agent processes the complete multimodal history and chooses an action from the discrete action space: > > - text_search, image_search, image_crop, produce final answer > > The enforced protocol requires each non-terminal step to output a reasoning thought and exactly one tool action, while the final step must generate a reasoning thought and produce the answer. Correct adherence is rewarded via . Reinforcement learning with BN-GSPO allows the agent to learn optimal policies for tool selection and invocation order, such as resolving pixel-level queries with image_crop or retrieving external information via search tools. This interleaved tool-reasoning cycle is necessary to address knowledge-intensive and visually complex questions insufficiently solvable via isolated tool or static reasoning approaches. > > ## 4. HR-MMSearch Benchmark > > HR-MMSearch is a high-resolution, knowledge-intensive benchmark created to evaluate search-driven multimodal reasoning agents. Core characteristics include: > > - 305 images at native 4K resolution, covering 8 thematic domains (Sports, Culture, Science & Technology, Business & Finance, Games, Academic Research, Geography & Travel, Others). > > - Each image paired with a single manually crafted question that is search-driven, knowledge-intensive, and targets small or inconspicuous content (<5% of image area) unanswerable from pixels alone. > > - At least one tool invocation is requisite for each query, with quality control and difficulty assessment performed by master’s-level professionals. > > - Difficulty split by agent pass rates: "Hard" set (0/8 success for Qwen2.5-VL-7B) and "Easy" set (≥1/8 success). > > - Evaluation via Agentic Pass@1 (zero-temperature decoding), judgements from GPT-4o. > > ## 5. Experimental Results and Comparative Performance > > SenseNova-MARS-8B demonstrates state-of-the-art results across major multimodal and search-oriented benchmarks. The table below summarizes the Pass@1% on MMSearch and HR-MMSearch, comparing SenseNova-MARS-8B to relevant baseline and proprietary models: > > | Model | MMSearch | HR-MMSearch | > |----------------------------|----------|-------------| > | Qwen3-VL-8B (direct) | 11.7 | 12.1 | > | Qwen3-VL-8B (zero-shot agent)| 47.4 | 27.9 | > | MMSearch-R1 (open-source) | 53.8 | 20.3 | > | GPT-5 (proprietary agent) | 52.6 | 38.4 | > | Gemini-3-Flash (agent) | 62.6 | 41.6 | > | SenseNova-MARS-8B (agent) | 67.8 | 41.6 | > > Across high-resolution perception tasks (Exact-Match%): > > | Model | V* Bench | HR-4K | HR-8K | MME-RealWorld | Avg. | > |------------------|----------|-------|-------|---------------|-------| > | Mini o3 | 88.2 | 77.5 | 73.3 | 65.5 | 76.1 | > | DeepEyesV2 | 81.8 | 77.9 | 73.8 | 64.9 | 74.6 | > | SenseNova-MARS-8B| 92.2 | 83.1 | 78.4 | 67.9 | 80.4 | > > These results reflect substantial gains over direct and zero-shot agentic VLM baselines, achieving +14.0 points on MMSearch and +3.0 points on HR-MMSearch relative to GPT-5. > > ## 6. Ablation Studies, Limitations, and Prospective Directions > > - BN-GSPO comparison: On the 7B model variant, BN-GSPO yields higher Pass@1 on MMSearch (56.7%) compared to GRPO (50.9%) and GSPO (53.8%), and increases V* Bench scores by 12–25 points. > > - RL Data Mixing: A hybrid RL dataset (search-oriented + perception data) surpasses specialized approaches; for example, exclusive perception training is detrimental to MMSearch, whereas hybridization provides +5.2 points. > > - Tool-Use Patterns: The framework adapts tool selection: cropping is almost exclusive on V* Bench, search is dominant on MMSearch, and a mixed approach is observed on HR-MMSearch. Reinforcement learning further reduces average tool invocations per query from ~4 to ~2. > > - Identified Limitations: Failure analysis identifies two leading sources of error: > - Retrieval noise (misinterpretation of similar data, e.g., "born in" vs. "based in") causes hallucinations. > - Tool parameter misconfigurations (imprecise cropping, overly generic queries) hinder fine-grained retrieval. > > - Future Work: Directions include expanding robust alignment strategies to counter distractor information, enriching the suite of available tools (e.g., OCR, knowledge-base access), and adopting self-supervised strategies to reduce the dependence on costly human-confirmed data. > > SenseNova-MARS substantiates the potential for end-to-end RL with BN-GSPO to produce multimodal agentic systems that emulate human-like proficiency in dynamic tool-use and multimodal reasoning, establishing new standards on complex, high-resolution, search-driven visual tasks (Chng et al., 30 Dec 2025).