Set-of-Marks (SoM): Visual Grounding Method

Updated 7 April 2026

Set-of-Marks (SoM) is a visual prompting method that overlays distinct, speakable marks on segmented image regions to enable precise visual grounding.
The approach preprocesses images using interactive segmentation tools and an optimized mark allocation algorithm without requiring model retraining.
Empirical results show that SoM significantly boosts zero-shot performance on tasks like referring expression comprehension and segmentation, rivaling specialist models.

Set-of-Mark (SoM) is a visual prompting method designed to enable precise visual grounding in large multimodal models (LMMs), notably GPT-4V. SoM operates by partitioning an input image into semantically meaningful regions at arbitrary granularity via interactive segmentation models and overlaying these regions with distinct, speakable marks such as alphanumeric labels, boxes, or mask boundaries. The marked image is then provided to the multimodal model for downstream tasks requiring fine-grained or unambiguous visual reference, including referring expression comprehension, segmentation, and phrase grounding. SoM does not require any model retraining or architecture modifications, leveraging existing latent capabilities within GPT-4V and similar systems to reference marked regions directly in their textual outputs (Yang et al., 2023).

1. Mathematical Formulation and Algorithmic Workflow

Let $I \in \mathbb{R}^{H \times W \times 3}$ denote the original RGB image, $T^i = [t_1^i, \dots, t_{l_i}^i]$ the input text query, and $T^o = [t_1^o, \dots, t_{l_o}^o]$ the text output, with the LMM's forward call $T^o = \mathcal{F}(I, T^i)$ . SoM preprocesses $I$ to yield a marked image $I^m$ by the following stages:

Segmentation: An off-the-shelf tool (e.g., SEEM, SAM, MaskDINO) decomposes $I$ at selected granularity into $K$ nonoverlapping or overlapping binary masks $R_i \in \{0,1\}^{H \times W}$ for $i = 1, \dots, K$ , chosen to delineate whole objects, parts, or subregions as required. Denote the region set $T^i = [t_1^i, \dots, t_{l_i}^i]$ 0.
Marking: Each region $T^i = [t_1^i, \dots, t_{l_i}^i]$ 1 is assigned a distinct, speakable mark $T^i = [t_1^i, \dots, t_{l_i}^i]$ 2 from a mark set $T^i = [t_1^i, \dots, t_{l_i}^i]$ 3 (e.g., digits, letters, boundaries, colored outlines). The marking function overlays each $T^i = [t_1^i, \dots, t_{l_i}^i]$ 4 at a computed anchor $T^i = [t_1^i, \dots, t_{l_i}^i]$ 5 (chosen for salience and non-overlap):

$T^i = [t_1^i, \dots, t_{l_i}^i]$ 6

Mark Allocation: Placement employs the following algorithm: regions are ordered by ascending area, an occupancy mask tracks used pixels, and for each region, the mark is placed at the point of maximal distance to the boundary among available pixels, with offsets as needed to ensure visibility.
Prompting: The LMM receives $T^i = [t_1^i, \dots, t_{l_i}^i]$ 7 and the text query $T^i = [t_1^i, \dots, t_{l_i}^i]$ 8, i.e., $T^i = [t_1^i, \dots, t_{l_i}^i]$ 9, thereby enabling the model to reference regions via their marks in its textual output.

2. Integration with Large Multimodal Models

In SoM's framework, the LMM's vision encoder processes marked pixels as salient features, while an internal OCR module detects $T^o = [t_1^o, \dots, t_{l_o}^o]$ 0 marks. During autoregressive decoding, the model can produce tokens referencing marks (e.g., "7", "B", "(mask 3)"), directly and unambiguously linking natural language to spatial regions.

A region-to-text alignment arises: each model reference to $T^o = [t_1^o, \dots, t_{l_o}^o]$ 1 maps deterministically back to $T^o = [t_1^o, \dots, t_{l_o}^o]$ 2, enabling region selection, object segmentation, referring comprehension, phrase grounding, and tracking. Crucially, this mechanism bypasses the need for direct coordinate or mask prediction by the LLM, using the mark mapping as an explicit grounding function.

3. Evaluation Metrics and Experimental Outcomes

Performance is assessed using standard segmentation and grounding metrics. For masks $T^o = [t_1^o, \dots, t_{l_o}^o]$ 3 (prediction) and $T^o = [t_1^o, \dots, t_{l_o}^o]$ 4 (ground truth):

Intersection-over-Union (IoU): $T^o = [t_1^o, \dots, t_{l_o}^o]$ 5
Mean IoU (mIoU): instance-averaged $T^o = [t_1^o, \dots, t_{l_o}^o]$ 6
[email protected]: proportion of predicted boxes with IoU $T^o = [t_1^o, \dots, t_{l_o}^o]$ 7 to ground truth
Precision/Recall: fraction of correct region labels among predictions / recall of ground-truth regions
DAVIS J&F: mean of region similarity (J) and boundary accuracy (F)

Main Experimental Results

Task	Baseline	Specialist (fully finetuned)	GPT-4V + SoM (zero-shot)
RefCOCOg Ref Expression Comprehension ([email protected])	Direct coordinate output, 25.7%	PolyFormer, 85.8%	86.4%
RefCOCOg Ref Expression Segmentation (mIoU)	~0	PolyFormer, 67.2%	75.6%
COCO Open-Vocab Segmentation (precision)	-	MaskDINO, 80.7%	75.7%
ADE20K Open-Vocab Segmentation (precision)	-	OpenSeeD zero-shot, 23.4%	63.4%
Flickr30K Phrase Grounding (Recall@1)	-	GLIPv2, 87.7%	89.2%
DAVIS2017 Video Object Segmentation (J∪F)	-	SegGPT, 75.6%	78.8%

SoM prompting lifts GPT-4V's RefCOCOg comprehension performance from 25.7% to 86.4% [email protected] and segmentation mIoU from near zero to 75.6%, surpassing specialist finetuned architectures in zero-shot settings. Similar improvements are observed in open vocabulary segmentation, phrase grounding, and video object tracking (Yang et al., 2023).

4. Empirical Examples and Visualization Cases

Illustrative qualitative examples highlight SoM's practical impact:

In a complex table scene, standard GPT-4V fails to unambiguously ground queries like "Which is the laptop on?" Instead, SoM overlays introduce numerical region marks ("7, 9, 12"), enabling direct reference (“The laptop labeled ‘9’”) in the model’s response.
In dense, multi-object environments (e.g., a kitchen), queries referencing a region (“What is in 3?”) precisely map to the correct mask ("a bowl of sliced fruits"), overcoming limitations of standard prompting for overlapping or small objects.

Ablation studies further demonstrate performance dependence on mark types: combining numbers with mask boundaries yields 84.4% Recall@1 on Flickr30K phrase grounding, increasing to 89.2% when box overlays are added. Use of ground-truth segmentation masks raises RefCOCOg mIoU from 75.6% (predicted masks) to 90.1% (Yang et al., 2023).

5. Technical Advantages and Broad Applicability

The SoM framework confers several benefits:

Marks are specifically chosen to be speakable, ensuring easy recognition and reference by GPT-4V’s OCR and language generation. This enables natural, precise textual grounding of visual regions.
SoM requires no model retraining or gradient updates. It leverages latent visual grounding skills inherent in the LMM and generalizes across tasks.
The method applies uniformly to a spectrum of fine-grained vision challenges, including open-vocabulary segmentation, referring expression comprehension, phrase grounding, visual navigation, and tool instruction. Its model-agnostic design extends beyond segmentation masks and is amenable to future adaptation for keypoints, 3D volumes, or continuous domains.

6. Discussion of Limitations and Open Research Questions

While SoM establishes strong empirical performance and practical versatility, several unresolved challenges persist:

When multiple regions’ optimal mark locations coincide, visual overlap may render marks ambiguous. The adopted distance-transform heuristic for placement reduces, but does not eliminate, such conflicts.
Some benchmarking datasets exhibit ambiguous annotations or segmentation masks, occasionally causing discordant model outputs; SoM can sometimes surface such annotation noise.
Automated, content-aware selection of mark types (beyond numerics), particularly for special cases like arithmetic visual puzzles or text-only screenshots, remains an open problem.
Extending SoM’s paradigm from 2D segmentation masks to more complex structures such as keypoints, 3D shapes, or continuous-space models will require further research.

Set-of-Mark Prompting serves as a lightweight, general-purpose interface layer that unlocks the robust, zero-shot visual grounding capacity latent within large multimodal models, achieving state-of-the-art or superior performance relative to task-specific experts across a range of fine-grained visual understanding tasks (Yang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Set-of-Marks (SoM).