SpelkeBench: Motion-Grounded Segmentation

Updated 14 April 2026

SpelkeBench is a motion-grounded evaluation benchmark that defines 'Spelke objects' as image regions with physically coherent co-movement.
It comprises 500 curated natural-image scenes with hand-annotated segments, enabling both point-prompted and automatic segmentation tasks.
The benchmark provides rigorous metrics and baseline comparisons with models like SpelkeNet and SAM2, facilitating research in physical object manipulation and scene understanding.

SpelkeBench is a motion-grounded evaluation benchmark for object segmentation, oriented around the concept of “Spelke objects”: image regions defined by physically coherent co-movement under external forces. Distinct from conventional semantic- or category-based segmentation datasets, SpelkeBench comprises 500 natural-image scenes with hand-annotated “Spelke segments” based on physical cohesion rather than visual or textual criteria. The benchmark supports the evaluation of algorithms that aim to recover physical object boundaries as understood from developmental psychology and robotics, providing mathematically defined metrics, competitive baselines, and illustrative applications for both scene understanding and interactive manipulation tasks (Venkatesh et al., 21 Jul 2025).

1. Construction of the SpelkeBench Dataset

SpelkeBench’s image set is sourced from two distinct corpora: (i) EntitySeg [Qi et al., ICCV ’23] containing dense semantic internet images, and (ii) OpenX-Embodiment [O’Neill et al., arXiv ’23] providing egocentric robot-interaction frames. From EntitySeg, a stringent three-stage filtering removes “stuff” (e.g., sky, terrain), excludes non-movable “thing” categories (e.g., fixtures), and ensures coverage diversity via manual curation to produce a core of 500 “Spelke-consistent” natural images.

Additional hand-annotations provide Spelke segments for 50 robot-in-the-loop frames in OpenX-Embodiment, reserved for future evaluations. Annotators are instructed to identify pixel groups that would move together as a rigid body if poked, disregarding semantic labels and instead applying an intuition for physical co-movement. Each image mask group is reviewed by two independent annotators with any conflicts resolved by a third expert. Formally, a Spelke segment $S \subset \Omega$ (where $\Omega$ is the image grid) is a maximal set of pixels satisfying:

Cohesion: pixels in $S$ move together under external force;
Continuity: $S$ is a single connected component;
Solidity: $S$ does not leak through solid obstacles;
Contact: $S$ excludes pixels remaining fixed under all plausible local pokes.

2. Benchmarking Tasks and Evaluation Metrics

SpelkeBench encompasses two segmentation tasks and a motion-based co-movement assessment:

2.1. Point-Prompted Segmentation

Given an image $I$ and a single “poke” point $p\in\Omega$ , the objective is to recover the ground-truth Spelke mask $M^\star$ containing $p$ . The key metrics are:

Average Recall (AR):

$\Omega$ 0

averaged over IoU thresholds $\Omega$ 1 ranging from 0.50 to 0.95.

Mean Intersection over Union (mIoU):

$\Omega$ 2

where

$\Omega$ 3

and $\Omega$ 4 is the prediction with the highest IoU against $\Omega$ 5.

2.2. Automatic (Unprompted) Segmentation

Here, models must identify all Spelke segments in an image without interaction. Evaluation uses:

Average Precision (AP): Fraction of predicted $\Omega$ 6 that match a ground-truth (GT) segment with $\Omega$ 7, averaged over $\Omega$ 8.
Average Recall (AR): Fraction of GT segments matched by any predicted segment.
F $\Omega$ 9-Score: $S$ 0.
mIoU: Defined as above.

Hungarian algorithm-based matching is applied for GT-predicted mask assignment.

2.3. Motion-Based Co-Movement Analysis

The dataset also supports analysis of co-movement via internal SpelkeNet maps:

Motion-affordance map $S$ 1: Probability that pixel $S$ 2 moves under a poke, computed via summation over flow tokens with motion vectors above threshold $S$ 3.
Expected-displacement map $S$ 4: Average predicted displacement at pixel $S$ 5.
Dot-correlation for co-movement: For $S$ 6 virtual-poke experiments at location $S$ 7,

$S$ 8

Thresholding $S$ 9 produces co-movement-segment regions.

3. Baselines, SpelkeNet, and Quantitative Results

SpelkeBench evaluates several models across all tasks:

Supervised reference: SAM2 (heir-large) point-prompted segmenter [Ravi et al., arXiv ’24].
Self-supervised: DINOv1, DINOv2 (attention map thresholding) [Caron et al., ICCV ’21; Oquab et al., NeurIPS ’23].
Counterfactual World Models (CWM): Patch-copy interventions with RAFT flow and clustering [Bear et al., ECCV ’24].
Unprompted methods: CutLER [Wang et al., CVPR ’23], ProMerge [Li & Shin, ECCV ’24].

Key results:

Model	AR (point)	mIoU (point)	AP (auto)	AR (auto)	F₁ (auto)	mIoU (auto)
DINOv2-G/14	0.2254	0.4553	—	—	—	—
CWM	0.3271	0.4807	—	—	—	—
SAM2	0.4816	0.6225	0.11	0.62	0.17	0.68
ProMerge	—	—	0.42	0.34	0.36	0.43
SpelkeNet	0.5411	0.6811	0.35	0.46	0.38	0.57

SpelkeNet outperforms SAM2 and all tested self-/unsupervised baselines on both point-prompted and unprompted segmentation. Qualitative evaluation (see Fig. 7–9 of (Venkatesh et al., 21 Jul 2025)) demonstrates more physically coherent segmentations.

In a downstream object-editing benchmark (3DEditBench), plugging SpelkeNet masks into multiple pipelines yields improvements in Edit-Adherence (EA) of 10–20% absolute, as well as superior MSE, PSNR, LPIPS, and SSIM compared to SAM2.

4. Practical Usage and Methodological Considerations

SpelkeBench is released with COCO-style JSON annotations and accompanying images, accessible via https://neuroailab.github.io/spelke_net. A standard workflow uses pycocotools for mask loading; a code snippet is provided in the dataset documentation.

As annotation is labor-intensive, SpelkeBench is intended strictly for evaluation; training end-to-end segmentation models directly on the images and masks is not recommended. There is no official train/test split; users typically apply k-fold cross-validation or a random 80/20 split for development. When applying data augmentations for training downstream models, mask-image alignment should be preserved by applying transformations identically.

5. Downstream Applications and Limitations

5.1. Applications

Physical object manipulation and planning: Spelke segments better align with plausible 3D editing targets and yield more effective grasp and manipulation proposals in simulated robotics.
Support-relation inference: Virtual poke experiments on base objects can identify all physically supported segments, directly reflecting support hierarchies.
Material property cues: Variation in the spatial spread of motion-affordance maps — for example, between rigid and cloth objects — suggests the utility of Spelke segment statistics for inferring deformability.

5.2. Limitations

Unprompted segment discovery is less reliable than point-prompted variants, with a tendency toward false merges if motion rollouts are noisy or the aggregation threshold is miscalibrated.
The definition of Spelke segments assumes object rigidity; articulated or liquid objects can yield non-cohesive “co-movement” statistics.
The current benchmark focuses on binary co-movement rather than support for hierarchical grouping or articulation, suggesting future research directions.

6. Significance and Context

SpelkeBench represents a shift from semantic- or label-driven segmentation to segmentation based on quantifiable, category-agnostic physical properties, formalizing the concept of “Spelke objects” in computer vision and robotics evaluation. It supplies rigorous, mathematically grounded metrics accompanied by extensive baseline comparisons, establishing causal counterfactual probing (as implemented in SpelkeNet) as a state-of-the-art method for discovering physically meaningful segmentation units. The benchmark thereby supports research into vision models aligning more closely with human physical reasoning and is foundational for downstream tasks such as manipulation, planning, and relational reasoning (Venkatesh et al., 21 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Discovering and using Spelke segments (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpelkeBench.

SpelkeBench: Motion-Grounded Segmentation

1. Construction of the SpelkeBench Dataset

2. Benchmarking Tasks and Evaluation Metrics

2.1. Point-Prompted Segmentation

2.2. Automatic (Unprompted) Segmentation

2.3. Motion-Based Co-Movement Analysis

3. Baselines, SpelkeNet, and Quantitative Results

4. Practical Usage and Methodological Considerations

5. Downstream Applications and Limitations

5.1. Applications

5.2. Limitations

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SpelkeBench: Motion-Grounded Segmentation

1. Construction of the SpelkeBench Dataset

2. Benchmarking Tasks and Evaluation Metrics

2.1. Point-Prompted Segmentation

2.2. Automatic (Unprompted) Segmentation

2.3. Motion-Based Co-Movement Analysis

3. Baselines, SpelkeNet, and Quantitative Results

4. Practical Usage and Methodological Considerations

5. Downstream Applications and Limitations

5.1. Applications

5.2. Limitations

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research