Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpelkeBench: Motion-Grounded Segmentation

Updated 14 April 2026
  • SpelkeBench is a motion-grounded evaluation benchmark that defines 'Spelke objects' as image regions with physically coherent co-movement.
  • It comprises 500 curated natural-image scenes with hand-annotated segments, enabling both point-prompted and automatic segmentation tasks.
  • The benchmark provides rigorous metrics and baseline comparisons with models like SpelkeNet and SAM2, facilitating research in physical object manipulation and scene understanding.

SpelkeBench is a motion-grounded evaluation benchmark for object segmentation, oriented around the concept of “Spelke objects”: image regions defined by physically coherent co-movement under external forces. Distinct from conventional semantic- or category-based segmentation datasets, SpelkeBench comprises 500 natural-image scenes with hand-annotated “Spelke segments” based on physical cohesion rather than visual or textual criteria. The benchmark supports the evaluation of algorithms that aim to recover physical object boundaries as understood from developmental psychology and robotics, providing mathematically defined metrics, competitive baselines, and illustrative applications for both scene understanding and interactive manipulation tasks (Venkatesh et al., 21 Jul 2025).

1. Construction of the SpelkeBench Dataset

SpelkeBench’s image set is sourced from two distinct corpora: (i) EntitySeg [Qi et al., ICCV ’23] containing dense semantic internet images, and (ii) OpenX-Embodiment [O’Neill et al., arXiv ’23] providing egocentric robot-interaction frames. From EntitySeg, a stringent three-stage filtering removes “stuff” (e.g., sky, terrain), excludes non-movable “thing” categories (e.g., fixtures), and ensures coverage diversity via manual curation to produce a core of 500 “Spelke-consistent” natural images.

Additional hand-annotations provide Spelke segments for 50 robot-in-the-loop frames in OpenX-Embodiment, reserved for future evaluations. Annotators are instructed to identify pixel groups that would move together as a rigid body if poked, disregarding semantic labels and instead applying an intuition for physical co-movement. Each image mask group is reviewed by two independent annotators with any conflicts resolved by a third expert. Formally, a Spelke segment SΩS \subset \Omega (where Ω\Omega is the image grid) is a maximal set of pixels satisfying:

  • Cohesion: pixels in SS move together under external force;
  • Continuity: SS is a single connected component;
  • Solidity: SS does not leak through solid obstacles;
  • Contact: SS excludes pixels remaining fixed under all plausible local pokes.

2. Benchmarking Tasks and Evaluation Metrics

SpelkeBench encompasses two segmentation tasks and a motion-based co-movement assessment:

2.1. Point-Prompted Segmentation

Given an image II and a single “poke” point pΩp\in\Omega, the objective is to recover the ground-truth Spelke mask MM^\star containing pp. The key metrics are:

  • Average Recall (AR):

Ω\Omega0

averaged over IoU thresholds Ω\Omega1 ranging from 0.50 to 0.95.

Ω\Omega2

where

Ω\Omega3

and Ω\Omega4 is the prediction with the highest IoU against Ω\Omega5.

2.2. Automatic (Unprompted) Segmentation

Here, models must identify all Spelke segments in an image without interaction. Evaluation uses:

  • Average Precision (AP): Fraction of predicted Ω\Omega6 that match a ground-truth (GT) segment with Ω\Omega7, averaged over Ω\Omega8.
  • Average Recall (AR): Fraction of GT segments matched by any predicted segment.
  • FΩ\Omega9-Score: SS0.
  • mIoU: Defined as above.

Hungarian algorithm-based matching is applied for GT-predicted mask assignment.

2.3. Motion-Based Co-Movement Analysis

The dataset also supports analysis of co-movement via internal SpelkeNet maps:

  • Motion-affordance map SS1: Probability that pixel SS2 moves under a poke, computed via summation over flow tokens with motion vectors above threshold SS3.
  • Expected-displacement map SS4: Average predicted displacement at pixel SS5.
  • Dot-correlation for co-movement: For SS6 virtual-poke experiments at location SS7,

SS8

Thresholding SS9 produces co-movement-segment regions.

3. Baselines, SpelkeNet, and Quantitative Results

SpelkeBench evaluates several models across all tasks:

  • Supervised reference: SAM2 (heir-large) point-prompted segmenter [Ravi et al., arXiv ’24].
  • Self-supervised: DINOv1, DINOv2 (attention map thresholding) [Caron et al., ICCV ’21; Oquab et al., NeurIPS ’23].
  • Counterfactual World Models (CWM): Patch-copy interventions with RAFT flow and clustering [Bear et al., ECCV ’24].
  • Unprompted methods: CutLER [Wang et al., CVPR ’23], ProMerge [Li & Shin, ECCV ’24].

Key results:

Model AR (point) mIoU (point) AP (auto) AR (auto) F₁ (auto) mIoU (auto)
DINOv2-G/14 0.2254 0.4553
CWM 0.3271 0.4807
SAM2 0.4816 0.6225 0.11 0.62 0.17 0.68
ProMerge 0.42 0.34 0.36 0.43
SpelkeNet 0.5411 0.6811 0.35 0.46 0.38 0.57

SpelkeNet outperforms SAM2 and all tested self-/unsupervised baselines on both point-prompted and unprompted segmentation. Qualitative evaluation (see Fig. 7–9 of (Venkatesh et al., 21 Jul 2025)) demonstrates more physically coherent segmentations.

In a downstream object-editing benchmark (3DEditBench), plugging SpelkeNet masks into multiple pipelines yields improvements in Edit-Adherence (EA) of 10–20% absolute, as well as superior MSE, PSNR, LPIPS, and SSIM compared to SAM2.

4. Practical Usage and Methodological Considerations

SpelkeBench is released with COCO-style JSON annotations and accompanying images, accessible via https://neuroailab.github.io/spelke_net. A standard workflow uses pycocotools for mask loading; a code snippet is provided in the dataset documentation.

As annotation is labor-intensive, SpelkeBench is intended strictly for evaluation; training end-to-end segmentation models directly on the images and masks is not recommended. There is no official train/test split; users typically apply k-fold cross-validation or a random 80/20 split for development. When applying data augmentations for training downstream models, mask-image alignment should be preserved by applying transformations identically.

5. Downstream Applications and Limitations

5.1. Applications

  • Physical object manipulation and planning: Spelke segments better align with plausible 3D editing targets and yield more effective grasp and manipulation proposals in simulated robotics.
  • Support-relation inference: Virtual poke experiments on base objects can identify all physically supported segments, directly reflecting support hierarchies.
  • Material property cues: Variation in the spatial spread of motion-affordance maps — for example, between rigid and cloth objects — suggests the utility of Spelke segment statistics for inferring deformability.

5.2. Limitations

  • Unprompted segment discovery is less reliable than point-prompted variants, with a tendency toward false merges if motion rollouts are noisy or the aggregation threshold is miscalibrated.
  • The definition of Spelke segments assumes object rigidity; articulated or liquid objects can yield non-cohesive “co-movement” statistics.
  • The current benchmark focuses on binary co-movement rather than support for hierarchical grouping or articulation, suggesting future research directions.

6. Significance and Context

SpelkeBench represents a shift from semantic- or label-driven segmentation to segmentation based on quantifiable, category-agnostic physical properties, formalizing the concept of “Spelke objects” in computer vision and robotics evaluation. It supplies rigorous, mathematically grounded metrics accompanied by extensive baseline comparisons, establishing causal counterfactual probing (as implemented in SpelkeNet) as a state-of-the-art method for discovering physically meaningful segmentation units. The benchmark thereby supports research into vision models aligning more closely with human physical reasoning and is foundational for downstream tasks such as manipulation, planning, and relational reasoning (Venkatesh et al., 21 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpelkeBench.