SceneParser-Bench Overview
- SceneParser-Bench is a large-scale benchmark for explicit hierarchical scene parsing, producing structured scene→object→part→affordance representations.
- The benchmark uses a multi-stage annotation pipeline with GPT-5 and SAM3 to generate high-quality labels and enforce cross-level binding.
- It employs tailored evaluation metrics, including multi-level F1 scores and ParseRate, to support downstream tasks like interaction-oriented reasoning and planning.
SceneParser-Bench is a large-scale benchmark designed for explicit hierarchical scene parsing, evaluating models’ ability to produce structured scene → object → part → affordance representations with tightly coupled cross-level bindings. Distinguished from previous benchmarks focused on object detection or segmentation, SceneParser-Bench emphasizes the capture of structured dependencies required for interaction-oriented visual understanding, enabling actionable scene representations and explicit support for downstream reasoning tasks (Xu et al., 14 May 2026).
1. Dataset Construction and Hierarchical Annotation Pipeline
SceneParser-Bench is generated via a multi-stage hierarchical data engine that annotates raw RGB images with scene hierarchies comprising object, part, and affordance labels.
- Stage 1: Scene-level Object Grounding.
- Stage 2: Object-centric Part Parsing.
- Each object bounding box is cropped; GPT-5 proposes functional part names (e.g., “handle,” “lid”).
- SAM3 segments part masks, converts them to tight boxes, and attaches each to its parent object.
- Stage 3: Affordance Parsing.
- GPT-5 generates affordance descriptions for each object crop and associates region text.
- SAM3 segments affordance regions; low-confidence masks are filtered, boxes are produced, and an interaction point is sampled within each box.
- Hierarchy Reconstruction and QC.
- Textual matching (when part names are referenced) or geometric containment (via overlap of affordance points and boxes) links each affordance box to exactly one part of one object.
- Automated consistency checks remove duplicate or invalid annotations, preserving a rooted four-level hierarchy.
The dataset includes:
- 110,000 training images and 5,000 validation images (no separately released test set; all evaluation is on validation).
- Total annotations: 777,000 objects, 1,140,000 parts, 1,740,000 affordances, and 1,740,000 valid object–part–affordance chains.
2. Formal Hierarchical Representation
The annotation schema is a rooted four-level directed hierarchy: scene → objects → parts → affordances. All bindings between hierarchy levels are explicit.
Let an image parse be
where each object node is
with the object category name, the bounding box, and the part set.
Each part is
with the part name, the part bounding box, and its affordances.
Each affordance is
with 0 the action label (e.g., “open”), and 1 a 2D interaction point.
This hierarchical design enforces that every 2 belongs to exactly one 3, and every 4 is associated to a specific 5, ensuring clear cross-level bindings for downstream processing.
3. Structure-Aware Evaluation Metrics
Evaluation in SceneParser-Bench is conducted using both hierarchical and holistic completeness measures.
- Three-Level Conditional Metrics: For each level 6 and an IoU threshold 7,
- True positives (8), false positives (9), and false negatives (0) are accumulated.
- Precision, recall, and F1 are computed:
1
2
Level-Specific Matching:
- Level-1: Objects are matched by identical class and box IoU ≥ τ.
- Level-2: Parts are matched within correctly paired objects (by part name, IoU ≥ τ).
- Level-3: Affordances are matched within correctly paired parts (by identical action label and predicted point falling within the ground-truth affordance or fallback part box).
- ParseRate (3): Denotes the fraction of “parse-eligible” ground-truth objects (those with any annotated part/affordance) whose matched predictions include all required parts/affordances:
4
ParseRate is reported both at the scene-level (all objects per image set) and at the object class level.
4. Benchmarking Protocol and Training Procedures
Training and evaluation protocols in SceneParser-Bench are tailored for explicit hierarchy decoding:
- Model/Architecture: Rex-Omni backbone with autoregressive JSON-style decoding of scene hierarchies.
- Token Serialization: Each bounding box is serialized to four 1,000-bin coordinate tokens, each point to two tokens.
- Loss: Cross-entropy applied to the full hierarchy token stream.
- Structural-Completion Pseudo Labels: Tree completeness is enforced through placeholder nodes where parts or affordances are missing (ignored in evaluation).
- Curriculum Learning: Three-phase regime—initial epochs only real annotations, followed by increasing proportions of pseudo-completed data, ensuring gradual exposure to incomplete structures in training.
Inference prompts the model with an image and optionally a target object name, producing a JSON-formatted hierarchy post-processed into the expected object–part–affordance structure.
5. Baselines and Comparative Results
SceneParser-Bench provides a rigorous evaluation of multiple baselines:
| Method | L1 F1 | L2 F1 | L3 F1 | ParseRate |
|---|---|---|---|---|
| MLLMs (closed/open-source) | ~15–35% | <7% | <3% | <25% |
| Perception-stitching | ~37% | ~22.4% | ~0% | ~15% |
| SceneParser (main model) | ~54.6% | ~37.5% | ~26.3% | ~53.2% |
Experiments indicate that while multimodal LLMs and perception-stitching baselines perform adequately at object-level (L1), they deteriorate sharply in part and affordance structure completion (L2, L3) and ParseRate. SceneParser’s unified hierarchical decoding yields a +15–20 pp gain in F1 at L2/L3 relative to stitching, and triples ParseRate, demonstrating improved cross-level binding.
Ablation studies highlight the importance of explicit hierarchy, structural pseudo labels, and curriculum learning:
- Nested vs. Flat Triplets: Nested hierarchies yield L3 F1 of 29.3% over 17.8% for flat triplet output (editor’s term).
- Affordance Context: Adding object and part context to point grounding increases F1 from 40.6% to 42.8%.
- Curriculum Learning: Curriculum improves ParseRate to 43.2% versus 40.5% (always pseudo), or 39.6% (no pseudo).
6. Transferability and Downstream Evaluation
SceneParser-Bench facilitates evaluation on both classic and novel tasks:
- COCO Object Detection: SceneParser achieves F1@IoU 0.5 of 66.8%, competitive with standard detectors (e.g., DINO-R50 at ∼68.8%).
- AGD20K Affordance Grounding: SceneParser yields point-in-mask accuracy of 87.7% on seen and 82.8% on unseen objects, outperforming prior affordance models (e.g., Affordance-R1 at 60.8% and 57.5%).
- Downstream Planning Probe: Parsing with SceneParser hierarchies enables explicit multi-step action chains (e.g., “drawer→handle→pull”), providing a decision-ready interface for interaction-oriented reasoning, as opposed to the incomplete, ambiguous localizations from task-only prompts.
This suggests that unified hierarchical scene parsing provides a robust substrate for higher-level visual reasoning and task planning.
7. Context and Significance in Structured Scene Understanding
SceneParser-Bench addresses critical gaps in visual semantics understanding where isolated object or part predictions fail to capture interaction-oriented relationships. Its structured, cross-level evaluation paradigm encourages models to produce representations that are both semantically and structurally actionable, rather than disconnected lists of visual elements. With its explicit emphasis on hierarchy, progressive curriculum, and completeness metrics, SceneParser-Bench represents a foundational resource for the development and assessment of unified visual perception systems, supporting both low-level recognition and high-level, task-driven reasoning (Xu et al., 14 May 2026).