Papers
Topics
Authors
Recent
Search
2000 character limit reached

SceneParser-Bench Overview

Updated 17 May 2026
  • SceneParser-Bench is a large-scale benchmark for explicit hierarchical scene parsing, producing structured scene→object→part→affordance representations.
  • The benchmark uses a multi-stage annotation pipeline with GPT-5 and SAM3 to generate high-quality labels and enforce cross-level binding.
  • It employs tailored evaluation metrics, including multi-level F1 scores and ParseRate, to support downstream tasks like interaction-oriented reasoning and planning.

SceneParser-Bench is a large-scale benchmark designed for explicit hierarchical scene parsing, evaluating models’ ability to produce structured scene → object → part → affordance representations with tightly coupled cross-level bindings. Distinguished from previous benchmarks focused on object detection or segmentation, SceneParser-Bench emphasizes the capture of structured dependencies required for interaction-oriented visual understanding, enabling actionable scene representations and explicit support for downstream reasoning tasks (Xu et al., 14 May 2026).

1. Dataset Construction and Hierarchical Annotation Pipeline

SceneParser-Bench is generated via a multi-stage hierarchical data engine that annotates raw RGB images with scene hierarchies comprising object, part, and affordance labels.

  • Stage 1: Scene-level Object Grounding.
    • GPT-5 generates candidate object names and referring expressions per image.
    • Object localization is performed by ensembling outputs from Grounding DINO, Rex-Omni, and SAM3. Detections are confidence-thresholded and merged for high-recall object bounding boxes.
  • Stage 2: Object-centric Part Parsing.
    • Each object bounding box is cropped; GPT-5 proposes functional part names (e.g., “handle,” “lid”).
    • SAM3 segments part masks, converts them to tight boxes, and attaches each to its parent object.
  • Stage 3: Affordance Parsing.
    • GPT-5 generates affordance descriptions for each object crop and associates region text.
    • SAM3 segments affordance regions; low-confidence masks are filtered, boxes are produced, and an interaction point is sampled within each box.
  • Hierarchy Reconstruction and QC.
    • Textual matching (when part names are referenced) or geometric containment (via overlap of affordance points and boxes) links each affordance box to exactly one part of one object.
    • Automated consistency checks remove duplicate or invalid annotations, preserving a rooted four-level hierarchy.

The dataset includes:

  • 110,000 training images and 5,000 validation images (no separately released test set; all evaluation is on validation).
  • Total annotations: 777,000 objects, 1,140,000 parts, 1,740,000 affordances, and 1,740,000 valid object–part–affordance chains.

2. Formal Hierarchical Representation

The annotation schema is a rooted four-level directed hierarchy: scene → objects → parts → affordances. All bindings between hierarchy levels are explicit.

Let an image parse be

H={Oi}i=1N,\mathcal{H} = \{ O_i \}_{i=1}^N ,

where each object node is

Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)

with cic_i the object category name, bi=(x1,y1,x2,y2)b_i = (x_1, y_1, x_2, y_2) the bounding box, and Pi={Pij}j=1Mi\mathcal{P}_i = \{ P_{ij} \}_{j=1}^{M_i} the part set.

Each part is

Pij=(qij,bij,Aij)P_{ij} = \left( q_{ij},\, b_{ij},\, \mathcal{A}_{ij} \right)

with qijq_{ij} the part name, bijb_{ij} the part bounding box, and Aij={Aijk}k=1Kij\mathcal{A}_{ij} = \{ A_{ijk} \}_{k=1}^{K_{ij}} its affordances.

Each affordance is

Aijk=(aijk,uijk)A_{ijk} = \left( a_{ijk},\, u_{ijk} \right)

with Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)0 the action label (e.g., “open”), and Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)1 a 2D interaction point.

This hierarchical design enforces that every Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)2 belongs to exactly one Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)3, and every Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)4 is associated to a specific Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)5, ensuring clear cross-level bindings for downstream processing.

3. Structure-Aware Evaluation Metrics

Evaluation in SceneParser-Bench is conducted using both hierarchical and holistic completeness measures.

  • Three-Level Conditional Metrics: For each level Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)6 and an IoU threshold Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)7,
    • True positives (Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)8), false positives (Oi=(ci,bi,Pi)O_i = \left( c_i,\, b_i,\, \mathcal{P}_i \right)9), and false negatives (cic_i0) are accumulated.
    • Precision, recall, and F1 are computed:

    cic_i1

    cic_i2

  • Level-Specific Matching:

    • Level-1: Objects are matched by identical class and box IoU ≥ τ.
    • Level-2: Parts are matched within correctly paired objects (by part name, IoU ≥ τ).
    • Level-3: Affordances are matched within correctly paired parts (by identical action label and predicted point falling within the ground-truth affordance or fallback part box).
  • ParseRate (cic_i3): Denotes the fraction of “parse-eligible” ground-truth objects (those with any annotated part/affordance) whose matched predictions include all required parts/affordances:

cic_i4

ParseRate is reported both at the scene-level (all objects per image set) and at the object class level.

4. Benchmarking Protocol and Training Procedures

Training and evaluation protocols in SceneParser-Bench are tailored for explicit hierarchy decoding:

  • Model/Architecture: Rex-Omni backbone with autoregressive JSON-style decoding of scene hierarchies.
  • Token Serialization: Each bounding box is serialized to four 1,000-bin coordinate tokens, each point to two tokens.
  • Loss: Cross-entropy applied to the full hierarchy token stream.
  • Structural-Completion Pseudo Labels: Tree completeness is enforced through placeholder nodes where parts or affordances are missing (ignored in evaluation).
  • Curriculum Learning: Three-phase regime—initial epochs only real annotations, followed by increasing proportions of pseudo-completed data, ensuring gradual exposure to incomplete structures in training.

Inference prompts the model with an image and optionally a target object name, producing a JSON-formatted hierarchy post-processed into the expected object–part–affordance structure.

5. Baselines and Comparative Results

SceneParser-Bench provides a rigorous evaluation of multiple baselines:

Method L1 F1 L2 F1 L3 F1 ParseRate
MLLMs (closed/open-source) ~15–35% <7% <3% <25%
Perception-stitching ~37% ~22.4% ~0% ~15%
SceneParser (main model) ~54.6% ~37.5% ~26.3% ~53.2%

Experiments indicate that while multimodal LLMs and perception-stitching baselines perform adequately at object-level (L1), they deteriorate sharply in part and affordance structure completion (L2, L3) and ParseRate. SceneParser’s unified hierarchical decoding yields a +15–20 pp gain in F1 at L2/L3 relative to stitching, and triples ParseRate, demonstrating improved cross-level binding.

Ablation studies highlight the importance of explicit hierarchy, structural pseudo labels, and curriculum learning:

  • Nested vs. Flat Triplets: Nested hierarchies yield L3 F1 of 29.3% over 17.8% for flat triplet output (editor’s term).
  • Affordance Context: Adding object and part context to point grounding increases F1 from 40.6% to 42.8%.
  • Curriculum Learning: Curriculum improves ParseRate to 43.2% versus 40.5% (always pseudo), or 39.6% (no pseudo).

6. Transferability and Downstream Evaluation

SceneParser-Bench facilitates evaluation on both classic and novel tasks:

  • COCO Object Detection: SceneParser achieves F1@IoU 0.5 of 66.8%, competitive with standard detectors (e.g., DINO-R50 at ∼68.8%).
  • AGD20K Affordance Grounding: SceneParser yields point-in-mask accuracy of 87.7% on seen and 82.8% on unseen objects, outperforming prior affordance models (e.g., Affordance-R1 at 60.8% and 57.5%).
  • Downstream Planning Probe: Parsing with SceneParser hierarchies enables explicit multi-step action chains (e.g., “drawer→handle→pull”), providing a decision-ready interface for interaction-oriented reasoning, as opposed to the incomplete, ambiguous localizations from task-only prompts.

This suggests that unified hierarchical scene parsing provides a robust substrate for higher-level visual reasoning and task planning.

7. Context and Significance in Structured Scene Understanding

SceneParser-Bench addresses critical gaps in visual semantics understanding where isolated object or part predictions fail to capture interaction-oriented relationships. Its structured, cross-level evaluation paradigm encourages models to produce representations that are both semantically and structurally actionable, rather than disconnected lists of visual elements. With its explicit emphasis on hierarchy, progressive curriculum, and completeness metrics, SceneParser-Bench represents a foundational resource for the development and assessment of unified visual perception systems, supporting both low-level recognition and high-level, task-driven reasoning (Xu et al., 14 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SceneParser-Bench.