Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM-Seg40K: Benchmark for Reasoning Segmentation

Updated 27 March 2026
  • LLM-Seg40K is a large-scale dataset for training and benchmarking vision-language models on reasoning-guided, open-vocabulary image segmentation tasks.
  • It is produced via a three-stage pipeline that combines high-quality image and mask preparation from LVIS and EgoObjects, detailed scene descriptions using LLaVA, and question generation through GPT-4.
  • Evaluation using metrics like IoU, gIoU, and cIoU shows enhanced zero-shot and fine-tuned performance, demonstrating its potential for robotics and complex segmentation applications.

LLM-Seg40K is a large-scale, automatically generated dataset designed for research in reasoning-oriented image segmentation, targeting evaluation and training of vision–LLMs able to resolve open-vocabulary, multi-object, and intention-driven segmentation tasks based on free-form language prompts. Developed in conjunction with the LLM-Seg framework, LLM-Seg40K constitutes a new benchmark for reasoning segmentation, with a focus on grounding implicit user intentions in image segmentation through LLM reasoning (Wang et al., 2024).

1. Automatic Data Generation Pipeline

LLM-Seg40K is synthesized using a three-stage pipeline that integrates foundational computer vision corpora, advanced segmentation models, and state-of-the-art LLMs:

(a) Image & Mask Preparation

  • Data sources consist of two major datasets:
    • LVIS v1.0: 10,000 images (split into 6,000 “simple” images with 2–5 LVIS categories and 4,000 “complex” images with ≥6 categories).
    • EgoObjects: 3,000 egocentric frames each with ≥2 annotated bounding-box categories.
  • Mask acquisition leverages high-quality, human-verified pixel-level instance masks from LVIS, while EgoObjects bounding boxes are converted to instance masks using a frozen Segment Anything Model (SAM) in “Everything” mode. Each bounding box is passed as a prompt to SAM; SAM’s IoU predictor removes candidates below a default threshold (τSAM0.5\tau_{SAM} \approx 0.5), and Non-Maximum Suppression eliminates duplicates.

(b) Image Description via LLaVA

  • Each image is passed to LLaVA-v1.5-13B with the instruction: “Please describe the content in this image within 10 sentences.”
  • The resulting description (“summary”) contextualizes the scene for subsequent prompt engineering.

(c) Question Generation via GPT-4

  • For each image, GPT-4 receives a single-turn, in-context prompt specifying its role as a “professional image annotator,” together with the LLaVA-generated summary and a list of ground-truth object categories.
  • A fixed exemplary question is provided as a template.
  • GPT-4 produces 3–5 distinct reasoning questions per image. Each question requires open-vocabulary, compositional reasoning (e.g., spatial, attribute-based, functional queries).
  • For supervision, each question is mapped to the union of one or more ground-truth segmentation masks (derived from referenced categories in the question).

Because existing LVIS masks are reused and only SAM-based conversion is applied to EgoObjects, the pipeline does not include custom loss functions or additional mask refinement steps.

2. Dataset Structure and Statistics

LLM-Seg40K encompasses 14,000 images and approximately 55,000 reasoning question–mask pairs. The distribution by dataset split is as follows:

Split # Images Avg. Q/M (train) Estimated # Pairs Distinct Categories
Train 11,000 3.95 ~43,450 1,458
Validation 1,000 ~4.0 ~4,000
Test 2,000 ~4.0 ~8,000
Total 14,000 ~55,450
  • Image resolution: LVIS images range from 640×480 to 1280×720; EgoObjects are 1280×720, resized to match preprocessing.
  • Annotation format: Each split is represented by a JSON file storing per-image entries including image_id, file_name, and a list of questions, where each question contains question_id, free-form text, and a list of associated mask_id_list.
  • Segmentation masks are encoded in COCO-compatible PNG files (one channel per mask instance).

3. Annotation Schema and Prompts

Questions in LLM-Seg40K (denoted xtextx_{text}) reference objects or object-unions using:

  • Category labels (“the mug”),
  • Relative attributes (“the larger apple”),
  • Spatial descriptors (“to the left of the laptop”),
  • Functional definitions (“that you can drink from without spilling”).

Answers are given as unions of corresponding ground-truth masks indicated via mask_id_list.

Example pairs:

  • Q: “Which electronic device in the scene lets you type text without looking away from the screen?” A: mask_id_list = 27
  • Q: “Among all cups on the table, which one is half-filled and has a straw?” A: mask_id_list = 14, 15
  • Q: “Find the animal standing closest to the red bicycle.” A: mask_id_list = 102

This schema enables annotation of prompts requiring spatial, attribute-mediated, and functional reasoning, as well as multi-instance selection.

4. Evaluation Metrics

Performance on LLM-Seg40K is assessed using segmentation overlap metrics:

IoU=PGPG\mathrm{IoU} = \frac{|P \cap G|}{|P \cup G|}

where PP is the predicted mask and GG the ground-truth mask.

  • Generalized IoU (gIoU): sample-averaged IoU,

gIoU=1Ni=1NIoUi\mathrm{gIoU} = \frac{1}{N}\sum_{i=1}^N \mathrm{IoU}_i

  • Cumulative IoU (cIoU): global-pixel IoU across the dataset,

cIoU=i=1NPiGii=1NPiGi\mathrm{cIoU} = \frac{\sum_{i=1}^N |P_i \cap G_i|}{\sum_{i=1}^N |P_i \cup G_i|}

  • Normalized cIoU (ncIoU) is omitted, as image sizes are normalized per split.

5. Baseline Performance and Model Comparison

LLM-Seg40K serves as a benchmark for vision–language reasoning segmentation models. Five methods were evaluated on the validation split:

Method LLM Fine-tuned on LLM-Seg40K gIoU (%) cIoU (%)
GRES × ✔ (zero-shot) 14.16 15.90
LISA 33.19 37.97
LISA × 37.59 48.49
LLM-Seg 36.02 39.42
LLM-Seg × 45.47 54.18

Key findings:

  • In zero-shot, the two-stage LLM-Seg method (frozen-SAM masks, LLM reasoning) outperforms LISA by a large margin (45.5 vs. 37.6 gIoU).
  • Fine-tuning on LLM-Seg40K yields a +8 gIoU improvement for LLM-Seg, compared to +4 for LISA, indicating that dataset noise affects end-to-end mask decoders more significantly.
  • Notable error modes include false positives caused by visually similar objects and ambiguous question grounding when question text insufficiently disambiguates targets relative to the image description (Wang et al., 2024).

6. Applications, Strengths, and Limitations

LLM-Seg40K addresses several use cases:

  • Training open-vocabulary, reasoning segmentation models suitable for robotics, such as fetching and table setting.
  • Benchmarking the ability of vision–language pipelines to resolve spatial and functional queries from free-form language.
  • Investigating models’ generalization to novel attribute–property compositions.

Strengths include:

  • Dataset scale (14,000 images, ~55,000 questions) exceeds prior reasoning-segmentation corpora by an order of magnitude.
  • Diversity: mixed photographic (LVIS) and egocentric (EgoObjects) images.
  • Fully automatic pipeline minimizes human annotation cost and enables extensibility.

Limitations and proposed mitigations:

  • Category distribution reflects LVIS/EgoObjects bias (frequent indoor objects); can be addressed via stratified sampling or augmentation with alternative image sources.
  • GPT-4 question styles may predominantly follow certain template patterns, leading to linguistic under-diversification; this can be mitigated through prompt-template variation and inclusion of curated human-written questions.
  • Mask conversion process in EgoObjects may introduce boundary artifacts; application of weak morpho-masks or spot human validation can reduce such errors.

This suggests that LLM-Seg40K provides a scalable infrastructure for future research in reasoning-guided segmentation, with adaptability to new domains due to its pipeline design, and robustness in its diverse challenge set, exceeding mere closed-vocabulary or referring expression baselines (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-Seg40K Dataset.