Table30: Real-Robot Manipulation Benchmark
- Table30 is a real-robot manipulation benchmark that evaluates vision-language-action models on complex, multi-stage tabletop tasks using four distinct robotic platforms.
- It employs dual metrics like Success Rate and Progress Score to capture both complete task execution and partial progress on challenges such as precise 3D manipulation and softbody handling.
- Baseline models like DM0 leverage unified embodied-native pretraining and hybrid gradient strategies, highlighting emerging trends and challenges in embodied AI.
Table30 is a large-scale real-robot manipulation benchmark designed to rigorously evaluate vision-language-action (VLA) models on long-horizon, multi-stage tabletop tasks. Introduced in the RoboChallenge suite, Table30 presents 30 distinct manipulation challenges distributed across four real-robot platforms: UR5, Franka Panda, ARX5, and Cobot Magic Aloha. The benchmark addresses the need for systematic, reproducible, and scalable evaluation mechanisms for VLA policies, facilitating the transparent comparison of both specialist and generalist embodied control algorithms on a broad spectrum of physical reasoning problems (Yakefu et al., 20 Oct 2025, Yu et al., 16 Feb 2026).
1. Task Set and Robotic Platforms
Table30’s task set encompasses a spectrum of manipulation and reasoning demands, requiring models to address core embodied AI challenges:
- Precise 3D manipulation: Several tasks demand sub-centimeter placement precision or accurate insertion, such as "plug in network cable" and "stack color blocks."
- Handling deformable/softbody objects: Tasks like "fold dishcloth" and "clean dining table" necessitate interaction with non-rigid items.
- Long-horizon and multi-stage sequencing: Many benchmarks (e.g., "make vegetarian sandwich," "arrange paper cups") decompose into up to five stages, including object search, grasp, classification, and destination-based logistics.
- Complex temporal dependencies: Tasks where a single visual frame may permit multiple valid policy actions, stressing the ability to condition next-step predictions on latent temporal context.
- Multi-view and occlusion robustness and bimanual coordination: Required by tasks such as "arrange flowers" and "sort electronic products."
The platform list is as follows:
| Task Example | Platform | Challenge Category |
|---|---|---|
| arrange paper cups | ARX5 | stacking, sequential placement |
| move objects into box | Franka | multi-item grasping |
| plug in network cable | ALOHA | fine manipulation, multi-stage |
| set the plates | UR5 | multi-object spatial arrangement |
Each robotic platform is equipped with only modest vision sensors (e.g., RGB), accentuating the challenge of perceptual precision (Yakefu et al., 20 Oct 2025).
2. Metrics and Evaluation Protocol
Table30 adopts dual evaluation metrics, offering granular and holistic views of model performance:
- Success Rate (SR):
SR measures the fraction of ten rollouts per task in which the robot executes all task stages without critical failure.
- Progress Score:
where is the point value for stage , counts retries, and is the number of stages per task (maximum 10 points per rollout).
- Composite Task Score (CTS): In some generalist settings, CTS weights partial completions, reporting holistic progress across 10 rollouts.
This protocol penalizes inefficiency and early termination, capturing both success/failure and skillful partial completion.
3. Model Variants and Baseline Approaches
The Table30 benchmark has been systematically utilized to compare several leading embodied VLA model classes:
- π₀, π₀.₅: Task-specific fine-tuned models differing in supervision or training detail.
- CogACT: Task-specific model with alternative control architecture.
- π₀/multi, π₀.₅/multi: "Machine-generalist" models trained on multiple tasks for a single robot.
- Spirit-v1.5, GigaBrain-0.1: Scalable open-source VLA baselines (3–4B parameters), trained with diverse web and embodied corpora (Yu et al., 16 Feb 2026).
- DM0: An Embodied-Native VLA model (2B parameters), jointly pretrained on vision-language, driving, and embodied data, employing hybrid gradient insulation and spatial chain-of-thought scaffolding (Yu et al., 16 Feb 2026).
Baseline comparisons rely chiefly on the "success rate (SR)" and, in the generalist regime, the composite task score (CTS).
4. Benchmark Results and Comparative Performance
Empirical results on Table30 demonstrate substantial performance stratification among model classes. Representative averages and highlights:
| Model | Params | Specialist SR (%) | Generalist SR / Score |
|---|---|---|---|
| DM0 | 2B | 62.00 | 37.3 / 49.08 |
| GigaBrain-0.1 | 3B | 51.67 | – |
| Spirit-v1.5 | 4B | 51.00 | – |
| π₀.₅ | 3B | 42.67 | 17.7 / 31.3 |
| π₀ | – | – | 9.0 / 20.2 |
Selected per-task Specialist SR (DM0 vs. best baseline):
| Task | DM0 | Best Baseline |
|---|---|---|
| arrange fruits in basket | 100 | 80 |
| stack bowls | 100 | 100 |
| plug in network cable | 80 | 20 |
| fold dishcloth | 20 | 20 |
| clean dining table | 0 | 40 |
Overall, specialist fine-tuned models outperform generalists, but models like DM0 substantially narrow this gap, nearly doubling generalist SR and CTS compared to prior baselines (Yu et al., 16 Feb 2026, Yakefu et al., 20 Oct 2025). π₀.₅ demonstrates consistent advantage over both π₀ and CogACT, with "multi-task" variants achieving competitive performance from limited cross-task data (Yakefu et al., 20 Oct 2025).
5. Task Taxonomy and Difficulty Analysis
Task tagging in Table30 exposes granular failure patterns and robustness across embodiment dimensions. Tag-based SR and Score (mean over tasks):
| Tag | #Tasks | Mean SR (%) | Mean Score |
|---|---|---|---|
| softbody | 3 | 8 | 27 |
| temporal dependence | 3 | 5 | 14 |
| precise3d | 12 | 18 | 38 |
| simple-pick | 4 | 42 | 47 |
| all tasks | 30 | 22 | 37 |
Softbody and temporal tasks are least tractable (≤10% SR), while precise 3D manipulation consistently depresses performance (≈18% SR across all models). "Simple" pick-and-place tasks are nearly saturated by top-performing specialist models (SR≈90%). Empirical analysis also indicates that, despite weak spatial reasoning in hard tasks, VLA models are robust to out-of-distribution perturbations in lighting, backgrounds, and even severe occlusion (Yakefu et al., 20 Oct 2025).
6. Architectural Advances and Insights from DM0
The DM0 model, evaluated on Table30, introduces several design strategies identified as critical to its superior performance:
- Unified embodied-native pretraining: Acquiring both semantic and physical priors from language, driving, and interaction corpora enhances generalization and physical task grounding.
- Hybrid gradient flow: Decoupling VLM backbone updates from embodied data prevents catastrophic forgetting, maintaining transferable representations across data regimes.
- Embodied Spatial Scaffolding: Enforcing spatial chain-of-thought reasoning narrows the permissible hypothesis space, improving efficiency on long-horizon and compositional tasks.
DM0 achieves higher sample efficiency and SR with 2B parameters than baselines with 3B–4B parameters, especially on complex multi-stage and generalist settings (Yu et al., 16 Feb 2026). A plausible implication is that embodied-native pretraining and architectural insulation are emerging as central motifs for scalable, generalist physical AI models.
7. Outlook and Open Challenges
Table30 exposes enduring deficiencies in current VLA models, particularly for tasks necessitating fine-grained spatial control, deformable object handling, temporal disambiguation, and long-horizon reasoning. The marked performance gap on softbody and multi-stage temporal tasks suggests a need for richer perceptual encodings, structured memory, and more powerful compositional policy representations. Multi-task and embodied-native pretraining offer promise in narrowing the generalization gap. Table30’s persistent task diversity and fine-grained difficulty tagging make it a durable benchmark for future progress in embodied policy research (Yakefu et al., 20 Oct 2025, Yu et al., 16 Feb 2026).