SpelkeBench: Physically Grounded Segmentation
- SpelkeBench is a curated dataset focused on segmenting physically grounded objects based on coherent responses to virtual interventions.
- It employs statistical counterfactual probing with SpelkeNet, an autoregressive self-supervised model, to predict expected displacement maps.
- Benchmarking shows that its motion-based segmentation outperforms appearance-focused models in both point-prompted and automatic discovery tasks.
SpelkeBench is a curated evaluation dataset designed to advance the paper of segmentation grounded in physical objecthood, specifically operationalizing the developmental psychology notion of "Spelke objects"—entities defined by their motion coherence under external forces rather than by semantic or category-specific features. SpelkeBench provides a benchmark for evaluating models that aim to discover and segment such physically meaningful groupings, supporting empirical progress at the intersection of perception, physics modeling, and manipulation planning.
1. Dataset Composition and Segment Definition
SpelkeBench contains 500 natural images sourced from two complementary domains: high-resolution internet images (derived from EntitySeg) and real-world, egocentric scenes encountered in robotics (sourced from OpenX-Embodiment). The dataset undergoes a multi-stage filtering pipeline to ensure all included scenes contain physically grounded, movable entities:
- Amorphous "stuff" regions (e.g., sky, terrain) are suppressed.
- Irremovable fixtures (e.g., sinks, built-in appliances) are excluded.
- The curation process adheres to principles central to Spelke objects, such as cohesion, continuity, solidity, and contact.
Spelke segments are not defined by semantic class or visual appearance. A segment is valid if, under a range of localized simulated interventions ("virtual pokes"), the constituent pixels exhibit consistent and coherent motion—i.e., they move as a single unit under applied physical forces. Ground-truth segments are manually annotated or curated via filtering to ensure compliance with this motion-coherence criterion.
Source Domain | # Images | Filtering Criteria |
---|---|---|
EntitySeg (internet) | ~250 | Movable, non-amorphous entities |
OpenX-Embodiment (robotics) | ~250 | Movable, physically plausible |
2. Methodology: Statistical Counterfactual Probing
Evaluation of Spelke objecthood is grounded in statistical counterfactual probing, a process that characterizes segmentation by predicted physical response rather than visual similarity:
- For each image, the approach simulates a set of "virtual pokes" at locations determined by a motion affordance map ().
- The model predicts, for each poke, a distribution over the expected optical flow, yielding an expected displacement map () for each pixel.
- SpelkeNet, the underlying model, infers these distributions via a self-supervised, autoregressive architecture.
Mathematically, after appending a poke into the tokenized input sequence (), SpelkeNet outputs a distribution over flow tokens for pixel :
The motion affordance at pixel is given by:
where , with the 2D flow vector for token and a magnitude threshold. The expected displacement is:
For consolidating across multiple pokes, an average dot product between the poke vector and the expected displacement vectors at each pixel is computed:
This dot map is thresholded (via Otsu’s method) to segment the pixels most causally coupled to the intervention.
A pixel-to-pixel affinity matrix , defined as with a motion descriptor, enables clustering-based fully automatic segmentation.
3. SpelkeNet: Visual World Model and Segmentation Engine
SpelkeNet is an autoregressive, self-supervised visual world model based on the Local Random Access Sequence Modeling (LRAS) architecture. It is trained on video data with a next-token prediction objective to model the distribution of plausible future motions in response to both image content and explicit intervention prompts:
- The input is a sequence: RGB tokens , a zero camera pose token (), and a sparse flow token for the poke.
- Given , SpelkeNet completes the flow field by decoding token-by-token.
- This process is inherently stochastic, supporting uncertainty and multimodality in predicted physical responses.
SpelkeNet underpins both the generation of motion affordance and expected displacement maps and serves as the canonical benchmark for discovering causal (physically meaningful) segment boundaries, as only pixels exhibiting highly correlated response to pokes are grouped.
4. Quantitative Benchmarking Against Alternative Segmentation Models
Evaluation on SpelkeBench is performed in two distinct settings: single point-prompted segmentation and fully automatic object discovery.
In the point-prompted regime:
- Each model receives a single point prompt (typically the centroid of the ground-truth object).
- Metrics include Average Recall (AR) and mean Intersection-over-Union (mIoU).
- SpelkeNet achieves AR ≈ 0.5411 and mIoU ≈ 0.6811.
- Supervised baselines, e.g., SegmentAnything (SAM), are outperformed; SAM tends to over-segment, often including non-movable elements such as logos or shadows. Self-supervised approaches like DINO and Counterfactual World Models (CWM) also underperform relative to SpelkeNet.
In automatic discovery (no point prompts):
- SpelkeNet’s affinity-based clustering attains superior Average Precision and F1, surpassing other self-supervised models (CutLER, ProMerge) and, notably, outperforms SAM on physically grounded metrics.
- A key advantage is SpelkeNet’s causally defined segment boundaries, which align with physical coherence rather than visual similarity.
Model | AR (Point) | mIoU (Point) | AP (Auto) | F1 (Auto) | Failure Modes |
---|---|---|---|---|---|
SpelkeNet | 0.5411 | 0.6811 | High | High | - |
SAM | Lower | Lower | Lower | Lower | Oversegmentation, non-movables |
DINO, CWM, CutLER | Lower | Lower | Lower | Lower | Appearance-focused, less causal |
5. Downstream Impact: Physical Object Manipulation (3DEditBench)
In the context of object manipulation, segmentation quality critically determines the fidelity of intended edits. The 3DEditBench benchmark evaluates manipulation models by the quality of mask-guided edits:
- Effective manipulation requires segments that correspond to objects that truly move together when acted upon.
- SpelkeNet-derived segments deliver higher Edit Adherence (EA), measured as the IoU between the transformed segment and ground-truth post-edit.
- Use of SpelkeBench-style segmentation yields more physically plausible edits than models relying on purely visual grouping (e.g., SAM), which may blend object with non-movable regions or texture details.
A plausible implication is that motion-coherence-based segmentation provides more reliable primitives for planning and robotic interaction, as opposed to appearance-based alternatives.
6. Context and Significance
SpelkeBench addresses limitations of traditional segmentation benchmarks that prioritize semantic or categorical consistency at the expense of physical coherence. By anchoring objecthood in motion predictability under external input, SpelkeBench and the associated SpelkeNet model support research in causal perception, self-supervised physical understanding, and interactive scene manipulation. This conceptual shift has practical ramifications for robotics, graphics, and simulation pipelines where physically grounded object modeling is essential. The benchmark thus encourages the development of models that can bridge the perceptual gap between human physical intuition and visual learning systems in artificial agents.