Seg-ReSearch: Agentic Segmentation Framework

Updated 5 February 2026

Seg-ReSearch is a novel agentic segmentation paradigm that interleaves multimodal language model reasoning with external search to overcome static knowledge limitations.
It decomposes user instructions into multi-hop queries and integrates live external data to generate precise segmentation masks for up-to-date or domain-specific challenges.
The framework employs a hierarchical reward mechanism in reinforcement learning to balance process guidance with segmentation accuracy, validated on benchmarks like OK-VOS.

Seg-ReSearch defines a novel agentic paradigm for segmentation in computer vision by interleaving multimodal LLM (MLLM) reasoning with external search, directly addressing the knowledge bottleneck faced by previous approaches that relied solely on the frozen internal knowledge of their backbone LLMs. This framework enables segmentation systems to handle dynamic, open-world queries—including those involving up-to-date events or domain-specific knowledge—by decomposing user instructions, querying external sources, integrating retrieved evidence, and ultimately generating segmentation masks. The framework introduces a hierarchical reward mechanism to reconcile sparse outcome signals with flexible, process-level reinforcement, and is evaluated on the newly constructed OK-VOS benchmark as well as previous @@@@1@@@@ datasets (Liang et al., 4 Feb 2026).

1. Motivation and Challenge of Knowledge Bottlenecks

Recent MLLM-based segmentation architectures, such as LISA and VideoSeg-R1, have advanced multi-modal reasoning capabilities for instruction-based segmentation. However, these systems are fundamentally limited by their static knowledge bases, resulting in failures when queries reference facts or objects not included in their training data—for example, contemporary events or niche product launches. Real-world image and video-based queries frequently require access to dynamic and external knowledge, such as segmenting newly released objects or actors who achieved specific accolades in recent timeframes. Addressing such queries necessitates not only decomposing multi-hop instructions but also issuing live searches and integrating newfound evidence into the segmentation workflow, which prior SOTA methods cannot accomplish.

2. Seg-ReSearch Architecture: Interleaved Reasoning and External Search

Seg-ReSearch implements an interleaved process between policy-driven reasoning and external search mechanisms, optimizing segmentation for open-world, dynamically evolving queries.

2.1 System Components

Segmentation Backbone (“Mask Generator”): A frozen, high-accuracy segmentation model (e.g., SAM2), accepting bounding-box and point prompts to generate final masks.
Policy MLLM (Reasoning Module): A trainable multimodal LLM that parses user queries and visual input, generates multi-turn chain-of-thought (MCoT) reasoning, and dynamically decides when and how to interact with external sources.
External Search Interface: Text and image search APIs returning the top-k relevant passages or images, providing the necessary external evidence for multi-hop or knowledge-intensive queries.
Reward Manager: Orchestrates a hierarchical reward structure for reinforcement learning.

2.2 Stepwise Interaction

At each interaction stage:

Inputs consist of $N$ low-resolution frames $\{f_1, ..., f_N\}$ and a natural-language query.
For timesteps $t=1...T$ $t = 1... T$ :
- The policy MLLM analyzes the available context (past information blocks and frames).
- When needed, it outputs $\langle$ search $\rangle$ blocks with search queries for text or images.
- Results are returned and added to the context as $\langle$ information $\rangle$ blocks.
- Visual-only reasoning is performed if external information is unnecessary.
Upon sufficient confidence or after $T$ steps, a keyframe index $k$ is output via $\langle$ keyframe $\rangle$ .
The chosen high-resolution frame $f_k$ is appended, and the model outputs an $\langle$ answer $\rangle$ JSON with 2D bounding box and point prompts.
These prompts drive the frozen backbone to produce the final segmentation mask.

This interleaved design enables the system to flexibly handle multi-hop, temporally sensitive, or domain-specific segmentation challenges.

3. Hierarchical Reward Mechanism and Formal Definitions

Seg-ReSearch is trained via reinforcement learning, employing a hierarchical reward $R$ that combines initial, process, and outcome-based signals:

$R = \alpha \cdot (R_{\mathrm{IGR}} + R_{\mathrm{TPR}}) + R_{\mathrm{OR}}$

Where $\alpha \in [0,1]$ determines the balance between process and outcome signals.

Initial Guidance Reward ( $R_{\mathrm{IGR}}$ ): Binary signal for “warm-start” search:

$R_{\mathrm{IGR}} = 1\{ \max_{s \in S} \mathrm{Sim}(a_0, s) > 0.5 \}$

$a_0$ : first policy search query; $S$ : expert queries; $\mathrm{Sim}$ : semantic similarity.

Tapering Process Reward ( $R_{\mathrm{TPR}}$ ): Saturating bonus to encourage valid exploration but discourage unnecessary actions:

$R_{\mathrm{TPR}} = 1 - (1 - p)^{\min(k, M)}$

$p\in[0,1]$ : base per-action reward, $k$ : valid actions, $M$ : bonus cap. Increased $k$ yields diminishing returns as $R_{\mathrm{TPR}} \rightarrow 1$ .

Outcome Reward ( $R_{\mathrm{OR}}$ ): Composite of downstream segmentation accuracy and keyframe selection:

$R_{\mathrm{OR}} = R_{\mathrm{iou}} + R_{\mathrm{l1}} + R_{\mathrm{point}} + R_{\mathrm{frame}}$

Where $R_{\mathrm{iou}}=1$ if IoU $>0.5$ , $R_{\mathrm{l1}}=1$ if $L1$ distance $<10$ px, $R_{\mathrm{point}}=1$ if the predicted point is present in the box and Euclidean distance $<100$ px, and $R_{\mathrm{frame}}$ scales with the largest

Markdown Report Issue Upgrade to Chat

References (1)

Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Seg-ReSearch.