SpelkeBench: Motion-Grounded Segmentation
- SpelkeBench is a motion-grounded evaluation benchmark that defines 'Spelke objects' as image regions with physically coherent co-movement.
- It comprises 500 curated natural-image scenes with hand-annotated segments, enabling both point-prompted and automatic segmentation tasks.
- The benchmark provides rigorous metrics and baseline comparisons with models like SpelkeNet and SAM2, facilitating research in physical object manipulation and scene understanding.
SpelkeBench is a motion-grounded evaluation benchmark for object segmentation, oriented around the concept of “Spelke objects”: image regions defined by physically coherent co-movement under external forces. Distinct from conventional semantic- or category-based segmentation datasets, SpelkeBench comprises 500 natural-image scenes with hand-annotated “Spelke segments” based on physical cohesion rather than visual or textual criteria. The benchmark supports the evaluation of algorithms that aim to recover physical object boundaries as understood from developmental psychology and robotics, providing mathematically defined metrics, competitive baselines, and illustrative applications for both scene understanding and interactive manipulation tasks (Venkatesh et al., 21 Jul 2025).
1. Construction of the SpelkeBench Dataset
SpelkeBench’s image set is sourced from two distinct corpora: (i) EntitySeg [Qi et al., ICCV ’23] containing dense semantic internet images, and (ii) OpenX-Embodiment [O’Neill et al., arXiv ’23] providing egocentric robot-interaction frames. From EntitySeg, a stringent three-stage filtering removes “stuff” (e.g., sky, terrain), excludes non-movable “thing” categories (e.g., fixtures), and ensures coverage diversity via manual curation to produce a core of 500 “Spelke-consistent” natural images.
Additional hand-annotations provide Spelke segments for 50 robot-in-the-loop frames in OpenX-Embodiment, reserved for future evaluations. Annotators are instructed to identify pixel groups that would move together as a rigid body if poked, disregarding semantic labels and instead applying an intuition for physical co-movement. Each image mask group is reviewed by two independent annotators with any conflicts resolved by a third expert. Formally, a Spelke segment (where is the image grid) is a maximal set of pixels satisfying:
- Cohesion: pixels in move together under external force;
- Continuity: is a single connected component;
- Solidity: does not leak through solid obstacles;
- Contact: excludes pixels remaining fixed under all plausible local pokes.
2. Benchmarking Tasks and Evaluation Metrics
SpelkeBench encompasses two segmentation tasks and a motion-based co-movement assessment:
2.1. Point-Prompted Segmentation
Given an image and a single “poke” point , the objective is to recover the ground-truth Spelke mask containing . The key metrics are:
- Average Recall (AR):
0
averaged over IoU thresholds 1 ranging from 0.50 to 0.95.
- Mean Intersection over Union (mIoU):
2
where
3
and 4 is the prediction with the highest IoU against 5.
2.2. Automatic (Unprompted) Segmentation
Here, models must identify all Spelke segments in an image without interaction. Evaluation uses:
- Average Precision (AP): Fraction of predicted 6 that match a ground-truth (GT) segment with 7, averaged over 8.
- Average Recall (AR): Fraction of GT segments matched by any predicted segment.
- F9-Score: 0.
- mIoU: Defined as above.
Hungarian algorithm-based matching is applied for GT-predicted mask assignment.
2.3. Motion-Based Co-Movement Analysis
The dataset also supports analysis of co-movement via internal SpelkeNet maps:
- Motion-affordance map 1: Probability that pixel 2 moves under a poke, computed via summation over flow tokens with motion vectors above threshold 3.
- Expected-displacement map 4: Average predicted displacement at pixel 5.
- Dot-correlation for co-movement: For 6 virtual-poke experiments at location 7,
8
Thresholding 9 produces co-movement-segment regions.
3. Baselines, SpelkeNet, and Quantitative Results
SpelkeBench evaluates several models across all tasks:
- Supervised reference: SAM2 (heir-large) point-prompted segmenter [Ravi et al., arXiv ’24].
- Self-supervised: DINOv1, DINOv2 (attention map thresholding) [Caron et al., ICCV ’21; Oquab et al., NeurIPS ’23].
- Counterfactual World Models (CWM): Patch-copy interventions with RAFT flow and clustering [Bear et al., ECCV ’24].
- Unprompted methods: CutLER [Wang et al., CVPR ’23], ProMerge [Li & Shin, ECCV ’24].
Key results:
| Model | AR (point) | mIoU (point) | AP (auto) | AR (auto) | F₁ (auto) | mIoU (auto) |
|---|---|---|---|---|---|---|
| DINOv2-G/14 | 0.2254 | 0.4553 | — | — | — | — |
| CWM | 0.3271 | 0.4807 | — | — | — | — |
| SAM2 | 0.4816 | 0.6225 | 0.11 | 0.62 | 0.17 | 0.68 |
| ProMerge | — | — | 0.42 | 0.34 | 0.36 | 0.43 |
| SpelkeNet | 0.5411 | 0.6811 | 0.35 | 0.46 | 0.38 | 0.57 |
SpelkeNet outperforms SAM2 and all tested self-/unsupervised baselines on both point-prompted and unprompted segmentation. Qualitative evaluation (see Fig. 7–9 of (Venkatesh et al., 21 Jul 2025)) demonstrates more physically coherent segmentations.
In a downstream object-editing benchmark (3DEditBench), plugging SpelkeNet masks into multiple pipelines yields improvements in Edit-Adherence (EA) of 10–20% absolute, as well as superior MSE, PSNR, LPIPS, and SSIM compared to SAM2.
4. Practical Usage and Methodological Considerations
SpelkeBench is released with COCO-style JSON annotations and accompanying images, accessible via https://neuroailab.github.io/spelke_net. A standard workflow uses pycocotools for mask loading; a code snippet is provided in the dataset documentation.
As annotation is labor-intensive, SpelkeBench is intended strictly for evaluation; training end-to-end segmentation models directly on the images and masks is not recommended. There is no official train/test split; users typically apply k-fold cross-validation or a random 80/20 split for development. When applying data augmentations for training downstream models, mask-image alignment should be preserved by applying transformations identically.
5. Downstream Applications and Limitations
5.1. Applications
- Physical object manipulation and planning: Spelke segments better align with plausible 3D editing targets and yield more effective grasp and manipulation proposals in simulated robotics.
- Support-relation inference: Virtual poke experiments on base objects can identify all physically supported segments, directly reflecting support hierarchies.
- Material property cues: Variation in the spatial spread of motion-affordance maps — for example, between rigid and cloth objects — suggests the utility of Spelke segment statistics for inferring deformability.
5.2. Limitations
- Unprompted segment discovery is less reliable than point-prompted variants, with a tendency toward false merges if motion rollouts are noisy or the aggregation threshold is miscalibrated.
- The definition of Spelke segments assumes object rigidity; articulated or liquid objects can yield non-cohesive “co-movement” statistics.
- The current benchmark focuses on binary co-movement rather than support for hierarchical grouping or articulation, suggesting future research directions.
6. Significance and Context
SpelkeBench represents a shift from semantic- or label-driven segmentation to segmentation based on quantifiable, category-agnostic physical properties, formalizing the concept of “Spelke objects” in computer vision and robotics evaluation. It supplies rigorous, mathematically grounded metrics accompanied by extensive baseline comparisons, establishing causal counterfactual probing (as implemented in SpelkeNet) as a state-of-the-art method for discovering physically meaningful segmentation units. The benchmark thereby supports research into vision models aligning more closely with human physical reasoning and is foundational for downstream tasks such as manipulation, planning, and relational reasoning (Venkatesh et al., 21 Jul 2025).