Papers
Topics
Authors
Recent
Search
2000 character limit reached

UNOBench: Obstruction Reasoning for Robotic Grasping

Updated 4 July 2026
  • UNOBench is a benchmark that assesses obstruction reasoning for robotic grasping by determining target accessibility in cluttered RGB‐D scenes using language cues.
  • It integrates synthetic and real subsets with detailed annotations like occlusion ratios, contact points, and structured object-centric obstruction graphs.
  • Its dual-setting design—Oracle and Natural Language Prompting—separates pure reasoning from language grounding to measure precise action sequencing.

Searching arXiv for the cited UNOBench paper and closely related benchmarking/context papers. UNOBench is a benchmark and dataset for obstruction reasoning for robotic grasping in cluttered environments: given a cluttered RGB-D scene and a language instruction that refers to a target object, the system must determine whether the target is directly graspable or whether other objects must first be removed, and if so, which ones. Introduced in "Obstruction reasoning for robotic grasping" (Jiao et al., 28 Nov 2025), it was designed to address a gap between ordinary visual grounding and action-centric accessibility planning. In the authors’ formulation, the central issue is target accessibility rather than mere visibility: a model may correctly identify a target object yet still fail because it cannot infer the chain of objects that obstruct the manipulator’s access path. UNOBench therefore centers evaluation on obstruction paths, target-centric obstruction graphs, and the selection of top-level obstructors that can be removed immediately.

1. Conceptual scope and problem definition

UNOBench studies a form of embodied reasoning in which the relevant question is not only which object is the target, but which objects must be cleared first so that the target becomes graspable (Jiao et al., 28 Nov 2025). The benchmark was introduced because cluttered grasping is not reducible to visual grounding alone. A target such as “the white iphone box” may be visible and correctly localized, yet still be inaccessible because other objects lie above it or press against it.

The paper formulates this as a problem of obstruction reasoning. Starting from a referred target object, one must infer the obstruction paths that proceed upward through the objects that must be cleared. These paths determine the necessary action sequencing: the robot should remove only objects that are currently accessible and lie at the top of the relevant obstruction paths, then re-evaluate until the target becomes graspable. The benchmark therefore evaluates accessibility judgment, dependency reasoning, and next-step action selection in a unified setting.

A recurring misconception is that UNOBench is simply a visual grounding benchmark with robotic language prompts. The paper argues otherwise. Ordinary visual grounding benchmarks test whether a model can localize a named object, while UNOBench tests whether a model can reason about the physical dependencies that determine whether that object can be grasped. Likewise, the benchmark is not equivalent to a generic spatial-reasoning dataset: the emphasis is on action-centric reasoning about which object to remove first.

Another important distinction is between UNOBench and UNOGrasp. UNOBench is the benchmark and data resource; UNOGrasp is the model trained on it. This separation matters because many of the paper’s claims concern the benchmark’s representational structure and evaluation protocol, not only the performance of the proposed model.

2. Dataset construction and composition

UNOBench is built on top of MetaGraspNetV2, which provides amodal segmentation and object geometry but does not provide the language grounding, object-centric obstruction structures, or action-oriented supervision required by the task (Jiao et al., 28 Nov 2025). The benchmark includes both synthetic and real subsets.

Subset Scale Annotations
Synthetic 6,255 scenes, 25,020 view images 97,066 object instances annotated with names, 108,174 reasoning paths
Real 520 scenes, 1 view per scene 2,232 object instances annotated with names, 2,552 obstruction paths

The synthetic subset is derived from 8,007 MetaGraspNetV2 scenes, each with 37 viewpoints. The construction pipeline filters out three categories of cases: empty scenes or scenes with only one object; physically implausible obstructions caused by simulation artifacts such as object penetration; and bidirectional or cyclic obstruction patterns. After filtering, 6,255 scenes are retained, and four viewpoints per scene are randomly sampled to produce 25,020 images.

Target objects are also filtered. Objects are excluded if the obstruction ratio is below 1%, because such light obstruction is considered visually ambiguous and not operationally relevant, or above 95%, because such heavy occlusion makes the object too difficult to recognize. The real subset follows the same construction pipeline, but all real scenes are used only for testing.

UNOBench is organized into two benchmark settings:

Setting Target reference Purpose
Oracle (with Set-of-Mark, SoM) Objects referred to by numeric ID Isolates reasoning performance because grounding ambiguity is removed
Natural Language Prompting (NLP) Objects referred to by free-form language Evaluates both obstruction reasoning and language-based grounding

This dual design is methodologically important. The Oracle setting controls for language-grounding ambiguity by explicitly marking objects with IDs, whereas the NLP setting requires the model to resolve free-form referring expressions such as “grasp the white box on the leftmost” or “To grasp right sugar box, which object is on top of it?” A plausible implication is that the two settings separate failures of reasoning from failures of grounding more cleanly than a single-format benchmark would.

3. Annotation pipeline and symbolic structure

UNOBench augments MetaGraspNetV2 with two ingredients absent from the base resource: human-usable language references and structured obstruction graphs (Jiao et al., 28 Nov 2025). The symbolic representation is constructed in four stages.

First, the benchmark performs Set-of-Marks preparation. For each scene image, unique numeric markers are overlaid on object instances in the ground-truth masks. Each object receives an integer ID and a centroid coordinate (x,y)(x, y).

Second, the pipeline extracts obstruction information from amodal masks. This includes contact points, obstruction ratios, and obstruction degree words such as “slightly,” “partially,” “mostly,” and “heavily obstructed.” These later serve as obstruction-aware cues during supervision.

Third, the benchmark constructs an object-centric obstruction graph for each potential target object. This is a directed graph in which nodes are object IDs and edges encode the relation “obstructed \rightarrow obstructing.” If object AA is blocked by object BB, the edge points from AA to BB. This graph yields one or more obstruction paths beginning at the target and ending at accessible top-level obstructors.

Fourth, the benchmark establishes ID-name-coordinate association. GPT-4o generates object names from the SoM-labeled scene, producing triplets of the form (id,name,(x,y))(\text{id}, \text{name}, (x,y)), after which the names are human-refined.

The natural-language annotation pipeline is substantial. GPT-4o first generates short referring expressions, instructed to use spatial specifiers such as left, right, top, and bottom when necessary. Human annotators on Prolific then revise those names for accuracy and uniqueness. The reported annotation details are: 196 annotators; native or primary English speakers; from six English-speaking countries; historical approval rate 99%\ge 99\%; each allowed to annotate up to 45 scenes; and an annotation task designed for 80 minutes to reduce fatigue. The paper reports 5,400 challenging images reviewed, 41,193 object names examined, 4,678 corrected images, and 17,261 revised object names. Annotators could mark an object as “indescribable object” if it was too heavily obstructed to support a reliable name; such samples are removed from the NLP setting.

A target-centric sample contains scene and view identity, target identity, obstruction graph information, one or more obstruction paths, the set of top-level obstructors, the set of objects the target depends on, graph depth and path count, difficulty label, pairwise obstruction relations with metadata, and, in the NLP setting, object names, coordinates, free-form target references, and expected reasoning and answer traces. A supplementary example includes a target object, obstruction paths, top_objects, depends_on, k_min, num_paths, new_difficulty, and pairwise metadata such as a relation like "3 occludes 4", a mask_ratio, and a contact point.

4. Formal task formulation, difficulty taxonomy, and evaluation

The paper formalizes obstruction reasoning as a target-centric directed graph over the visible objects O={o1,o2,,oN}\mathcal{O} = \{o_1, o_2, \ldots, o_N\} (Jiao et al., 28 Nov 2025). Given an image observation I=(rgb,d)I=(rgb,d) and a language instruction \rightarrow0, the goal is to reason about a referred target \rightarrow1. The target-centric obstruction graph is

\rightarrow2

where nodes include the target and all direct or indirect obstructors, and an edge \rightarrow3 means that object \rightarrow4 is obstructed by object \rightarrow5.

The set of ancestor objects of the target, namely all objects on obstruction paths originating at the target, is

\rightarrow6

The set of top-level obstructors, namely ancestor objects that are themselves unobstructed and hence immediately removable, is

\rightarrow7

The benchmark objective is then written as

\rightarrow8

The expected output has two parts: a > ... section containing the reasoning trace over obstruction paths, and an <answer>...</answer> section containing the final top-level removable object or objects, or the target itself if it is unobstructed. In the Oracle setting, answers are JSON lists of object IDs. In the NLP setting, each reasoning step should mention explicit coordinates, and the final answer format is AA9

UNOBench divides targets into four difficulty levels according to graph depth \rightarrow9 and number of paths AA0:

Level Definition Interpretation
No-Occ AA1 No obstruction
Easy AA2 Single-path reasoning
Medium AA3 or AA4 Multi-path or shallow-depth reasoning
Hard AA5 or AA6 Deep or structurally complex reasoning

This taxonomy is central to the benchmark’s design because it distinguishes unobstructed targets from shallow obstruction cases and from genuinely multi-step dependency structures.

The synthetic scenes are split 7:1:2 into train, validation, and test; the real scenes are test only. The supplementary reports the following counts by setting: Synthetic Oracle (SoM) has 67,945 train objects, 9,539 validation objects, and 8,863 test objects; Synthetic NLP has 61,690 train objects, 8,785 validation objects, and 8,526 test objects; the Real test set contains 2,232 objects for Oracle and 2,232 objects for NLP. The synthetic NLP test split further contains 1,993 No-Obs, 4,477 Easy, 1,936 Medium, and 180 Hard cases. The relatively small Hard split matters when interpreting reported results.

Evaluation proceeds at three levels. Outcome-level metrics compare the predicted top-obstructor set AA7 with the ground truth AA8 using SR-P, SR-R, and SR-F1. Object-level reasoning metrics measure whether the model recovers correct pairwise obstruction triplets using OP, OR, and AA9; a true positive requires both the correct objects and the correct obstruction direction. Path-level reasoning is evaluated with Multi-Path Normalized Edit Distance (MP_NED). If predicted paths are BB0 and ground-truth paths are BB1, then

BB2

and

BB3

with BB4. The supplementary clarifies that dummy paths with penalty BB5 are added when BB6. Lower MP_NED is better. This metric is distinctive because it scores the fidelity of multi-step chain reconstruction rather than only the correctness of the final next-step action.

5. Use in UNOGrasp and empirical findings

UNOGrasp is the model introduced to exploit UNOBench’s supervision. It is based on Qwen2.5-VL-3B, trained only on a portion of the synthetic UNOBench data, and designed around a target-centric obstruction graph rather than a full scene graph (Jiao et al., 28 Nov 2025). Its reasoning process is benchmark-aligned: ground the target from language in the RGB image, determine whether it is obstructed, trace one or more obstruction paths outward from the target, identify the accessible top-level obstructors BB7, and output them as the next removal candidates.

UNOBench provides obstruction-aware visual cues such as contact points, obstruction ratios, and obstruction degree words. The paper highlights obstruction ratio as especially useful. In supervised fine-tuning, each sample is BB8, where BB9 is the RGB observation, AA0 the free-form target instruction, AA1 the reasoning chain describing AA2, and AA3 the final answer for AA4. Reinforcement fine-tuning then uses a reward

AA5

where the formatting reward checks the <think> and <answer> structure, and the task reward is the set-level IoU between predicted and ground-truth top-obstructor sets. This is a benchmark-driven training design because the annotations directly support verifiable rewards without human preference labeling.

The reported baselines are Gemini Robotics-ER 1.5 with base prompting and a 3-shot ICL variant, Qwen2.5-VL-3B with ICL, Qwen SFT trained on UNOBench but without <think> reasoning supervision, and UNOGrasp itself. On the synthetic test set in the Oracle (SoM) setting, UNOGrasp achieves 94.8 No obstruction SR, 83.3 Easy SR-F1, 69.1 Medium SR-F1, and 54.5 Hard SR-F1. Its MP_NED is 0.03 for No-Occ, 0.11 for Easy, 0.37 for Medium, and 0.51 for Hard. In the synthetic NLP setting, UNOGrasp achieves 92.5 No obstruction SR, 74.9 Easy SR-F1, 59.7 Medium SR-F1, and 37.2 Hard SR-F1, again with the lowest MP_NED among models with reasoning traces. The paper explicitly notes that on the synthetic hard split, UNOGrasp improves over Qwen2.5-VL (SFT) by +20.2% SR-F1.

Object-level reasoning results reinforce the same pattern. On synthetic Oracle, overall AA6 is 75.3 for UNOGrasp, compared with 50.6 for Gemini, 62.3 for Gemini ICL, and 21.1 for Qwen ICL. By difficulty, UNOGrasp records 82.6 on Easy, 62.0 on Medium, and 48.7 on Hard. On synthetic NLP, overall AA7 is 57.2 for UNOGrasp, whereas the generalist baselines are near zero. The paper treats this as evidence that generalist models may sometimes predict a plausible final object while failing at grounded multi-step reasoning with names and coordinates.

The real UNOBench subset is a generalization test because UNOGrasp is trained only on synthetic data. In Oracle, UNOGrasp achieves 72.5 No-Occ SR, 77.2 Easy SR-F1, 64.4 Medium SR-F1, and 63.9 Hard SR-F1. The paper highlights the real hard split, where UNOGrasp exceeds Qwen SFT by +38.0% SR-F1. In real NLP, UNOGrasp reaches 70.0 No-Occ SR, 71.3 Easy SR-F1, 47.3 Medium SR-F1, and 40.3 Hard SR-F1. Object-level reasoning on the real set shows overall AA8 of 71.4 in Oracle and 49.1 in NLP for UNOGrasp.

The ablation studies are especially informative about the benchmark’s annotation design. On the synthetic set overall, baseline SFT gives SR-F1 74.7, OR-F1 71.9, and MP_NED 0.220. Adding contact point yields 75.3, 72.5, and 0.216; adding a degree word yields 75.1, 72.5, and 0.217; adding occlusion ratio yields the best values, SR-F1 76.4, OR-F1 73.3, and MP_NED 0.210. On Hard Oracle cases, the supplementary reports +5.8% SR-F1 from the occlusion-ratio cue over the no-obstruction-information baseline. Starting from the best SFT model, reinforcement fine-tuning further improves Easy SR-F1 from 81.8 to 83.3, Medium from 67.1 to 69.1, Hard from 50.1 to 54.5, and overall from 76.4 to 78.2. Even though the reward supervises only the final answer, OR-F1 improves from 73.3 to 75.3, and MP_NED improves from 0.210 to 0.201.

The paper also connects benchmark performance to physical execution. In 30 real-world scenarios with 25 distinct objects, using a UR5e robot, a top-down ZED 2 camera, GroundedSAM, and GraspNet, the reported real-robot success rates are: Gemini Robotics-ER 1.5 at 80% / 30% / 10% / 40% for Easy / Medium / Hard / Average; Qwen2.5-VL at 10% / 0% / 0% / 3%; and UNOGrasp at 80% / 30% / 40% / 50%. UNOGrasp matches Gemini on Easy and Medium but exceeds it by +30% on Hard. This suggests that the benchmark’s largest gains appear in the deep, multi-path cases it was explicitly designed to measure.

6. Limitations, misconceptions, and benchmarking context

The paper implies several limitations of UNOBench (Jiao et al., 28 Nov 2025). The benchmark is largely single-view, and obstruction is defined “when viewed from the camera viewpoint,” so some accessibility judgments are viewpoint-dependent. The authors list multi-view perception as future work. The task also focuses on the existence of obstruction rather than its manipulation difficulty. The action plan is based on whether an obstruction exists, not on how severe or mechanically difficult it is to remove, and the authors note that different obstructions may impede grasping differently.

The NLP setting introduces an annotation bottleneck. Large-scale human correction was required to produce reliable object descriptions, and heavily occluded objects could be marked “indescribable object” and removed from the language-grounding setting. This means the NLP subset excludes some extremely difficult cases. The benchmark also filters out cyclic or implausible obstruction patterns and excludes objects with extremely low or extremely high obstruction ratios. These choices improve cleanliness, but they also limit coverage of edge cases.

Even the benchmark-trained model still fails in scenes with objects that touch without truly obstructing, dense clusters of visually similar items, severe object similarity, and high-contrast imaging conditions. More generally, UNOBench does not directly evaluate low-level grasp execution inside the benchmark itself. It evaluates target grounding, accessibility judgment, obstruction graph reconstruction, and next-object selection; the robotic execution results are a separate validation layer.

Within the broader literature on benchmarking, UNOBench occupies a different role from general benchmark systems such as BenchBench, Omnibenchmark, and OpenPerf. BenchBench evaluates the ability of models to design benchmarks through domain cards, quota-controlled generation, panel-based validation, and designer–answerer matrices (Zheng et al., 21 Mar 2026). Omnibenchmark is an alpha-stage benchmarking system for bioinformatics built around YAML configuration, dynamic workflow generation with Snakemake, reproducible environments, and S3-compatible storage (Mallona et al., 2024). OpenPerf is a framework for the open-source ecosystem that combines task benchmarks, index benchmarks, and standard benchmarks across time series, text, and graph data (Bi et al., 2023). By contrast, UNOBench is a domain-specific benchmark and training resource centered on a single embodied reasoning problem: unobstructing-and-grasping under language-conditioned target selection. This suggests that UNOBench belongs to the benchmark-instance layer rather than the benchmark-platform layer.

The benchmark’s novelty lies in the combination of several elements that the paper treats as jointly necessary: language-grounded object references, target-centric obstruction graphs, multi-step obstruction paths tied to action sequencing, rich obstruction metadata such as ratios and contact points, three-level evaluation spanning final actions, pairwise relations, and full paths, and the inclusion of both synthetic and real benchmark components. Its broader significance follows from that design. Rather than treating cluttered grasping as a localization problem or a generic planning problem, UNOBench formalizes it as reasoning over target-centric obstruction paths to determine the immediate valid removal actions. For work on visually grounded reasoning, clutter-aware planning, and sim-to-real transfer, that formulation is the benchmark’s central contribution.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UNOBench.