Papers
Topics
Authors
Recent
Search
2000 character limit reached

Table30: Real-Robot Manipulation Benchmark

Updated 3 July 2026
  • Table30 is a real-robot manipulation benchmark that evaluates vision-language-action models on complex, multi-stage tabletop tasks using four distinct robotic platforms.
  • It employs dual metrics like Success Rate and Progress Score to capture both complete task execution and partial progress on challenges such as precise 3D manipulation and softbody handling.
  • Baseline models like DM0 leverage unified embodied-native pretraining and hybrid gradient strategies, highlighting emerging trends and challenges in embodied AI.

Table30 is a large-scale real-robot manipulation benchmark designed to rigorously evaluate vision-language-action (VLA) models on long-horizon, multi-stage tabletop tasks. Introduced in the RoboChallenge suite, Table30 presents 30 distinct manipulation challenges distributed across four real-robot platforms: UR5, Franka Panda, ARX5, and Cobot Magic Aloha. The benchmark addresses the need for systematic, reproducible, and scalable evaluation mechanisms for VLA policies, facilitating the transparent comparison of both specialist and generalist embodied control algorithms on a broad spectrum of physical reasoning problems (Yakefu et al., 20 Oct 2025, Yu et al., 16 Feb 2026).

1. Task Set and Robotic Platforms

Table30’s task set encompasses a spectrum of manipulation and reasoning demands, requiring models to address core embodied AI challenges:

  • Precise 3D manipulation: Several tasks demand sub-centimeter placement precision or accurate insertion, such as "plug in network cable" and "stack color blocks."
  • Handling deformable/softbody objects: Tasks like "fold dishcloth" and "clean dining table" necessitate interaction with non-rigid items.
  • Long-horizon and multi-stage sequencing: Many benchmarks (e.g., "make vegetarian sandwich," "arrange paper cups") decompose into up to five stages, including object search, grasp, classification, and destination-based logistics.
  • Complex temporal dependencies: Tasks where a single visual frame may permit multiple valid policy actions, stressing the ability to condition next-step predictions on latent temporal context.
  • Multi-view and occlusion robustness and bimanual coordination: Required by tasks such as "arrange flowers" and "sort electronic products."

The platform list is as follows:

Task Example Platform Challenge Category
arrange paper cups ARX5 stacking, sequential placement
move objects into box Franka multi-item grasping
plug in network cable ALOHA fine manipulation, multi-stage
set the plates UR5 multi-object spatial arrangement

Each robotic platform is equipped with only modest vision sensors (e.g., 224×224224 \times 224 RGB), accentuating the challenge of perceptual precision (Yakefu et al., 20 Oct 2025).

2. Metrics and Evaluation Protocol

Table30 adopts dual evaluation metrics, offering granular and holistic views of model performance:

SR=1Ni=1N1[rollouti is a complete success],N=10\text{SR} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl[\text{rollout}_i \text{ is a complete success}\bigr], \qquad N = 10

SR measures the fraction of ten rollouts per task in which the robot executes all task stages without critical failure.

  • Progress Score:

Si=j=1Jmax{0,pj0.5ri,j},S_i = \sum_{j=1}^{J}\max\{0,\,p_j - 0.5\,r_{i,j}\},

Progress Score=i=1NSi,\text{Progress Score} = \sum_{i=1}^N S_i,

where pjp_j is the point value for stage jj, ri,jr_{i,j} counts retries, and JJ is the number of stages per task (maximum 10 points per rollout).

  • Composite Task Score (CTS): In some generalist settings, CTS weights partial completions, reporting holistic progress across 10 rollouts.

This protocol penalizes inefficiency and early termination, capturing both success/failure and skillful partial completion.

3. Model Variants and Baseline Approaches

The Table30 benchmark has been systematically utilized to compare several leading embodied VLA model classes:

  • π₀, π₀.₅: Task-specific fine-tuned models differing in supervision or training detail.
  • CogACT: Task-specific model with alternative control architecture.
  • π₀/multi, π₀.₅/multi: "Machine-generalist" models trained on multiple tasks for a single robot.
  • Spirit-v1.5, GigaBrain-0.1: Scalable open-source VLA baselines (3–4B parameters), trained with diverse web and embodied corpora (Yu et al., 16 Feb 2026).
  • DM0: An Embodied-Native VLA model (2B parameters), jointly pretrained on vision-language, driving, and embodied data, employing hybrid gradient insulation and spatial chain-of-thought scaffolding (Yu et al., 16 Feb 2026).

Baseline comparisons rely chiefly on the "success rate (SR)" and, in the generalist regime, the composite task score (CTS).

4. Benchmark Results and Comparative Performance

Empirical results on Table30 demonstrate substantial performance stratification among model classes. Representative averages and highlights:

Model Params Specialist SR (%) Generalist SR / Score
DM0 2B 62.00 37.3 / 49.08
GigaBrain-0.1 3B 51.67
Spirit-v1.5 4B 51.00
π₀.₅ 3B 42.67 17.7 / 31.3
π₀ 9.0 / 20.2

Selected per-task Specialist SR (DM0 vs. best baseline):

Task DM0 Best Baseline
arrange fruits in basket 100 80
stack bowls 100 100
plug in network cable 80 20
fold dishcloth 20 20
clean dining table 0 40

Overall, specialist fine-tuned models outperform generalists, but models like DM0 substantially narrow this gap, nearly doubling generalist SR and CTS compared to prior baselines (Yu et al., 16 Feb 2026, Yakefu et al., 20 Oct 2025). π₀.₅ demonstrates consistent advantage over both π₀ and CogACT, with "multi-task" variants achieving competitive performance from limited cross-task data (Yakefu et al., 20 Oct 2025).

5. Task Taxonomy and Difficulty Analysis

Task tagging in Table30 exposes granular failure patterns and robustness across embodiment dimensions. Tag-based SR and Score (mean over tasks):

Tag #Tasks Mean SR (%) Mean Score
softbody 3 8 27
temporal dependence 3 5 14
precise3d 12 18 38
simple-pick 4 42 47
all tasks 30 22 37

Softbody and temporal tasks are least tractable (≤10% SR), while precise 3D manipulation consistently depresses performance (≈18% SR across all models). "Simple" pick-and-place tasks are nearly saturated by top-performing specialist models (SR≈90%). Empirical analysis also indicates that, despite weak spatial reasoning in hard tasks, VLA models are robust to out-of-distribution perturbations in lighting, backgrounds, and even severe occlusion (Yakefu et al., 20 Oct 2025).

6. Architectural Advances and Insights from DM0

The DM0 model, evaluated on Table30, introduces several design strategies identified as critical to its superior performance:

  • Unified embodied-native pretraining: Acquiring both semantic and physical priors from language, driving, and interaction corpora enhances generalization and physical task grounding.
  • Hybrid gradient flow: Decoupling VLM backbone updates from embodied data prevents catastrophic forgetting, maintaining transferable representations across data regimes.
  • Embodied Spatial Scaffolding: Enforcing spatial chain-of-thought reasoning narrows the permissible hypothesis space, improving efficiency on long-horizon and compositional tasks.

DM0 achieves higher sample efficiency and SR with 2B parameters than baselines with 3B–4B parameters, especially on complex multi-stage and generalist settings (Yu et al., 16 Feb 2026). A plausible implication is that embodied-native pretraining and architectural insulation are emerging as central motifs for scalable, generalist physical AI models.

7. Outlook and Open Challenges

Table30 exposes enduring deficiencies in current VLA models, particularly for tasks necessitating fine-grained spatial control, deformable object handling, temporal disambiguation, and long-horizon reasoning. The marked performance gap on softbody and multi-stage temporal tasks suggests a need for richer perceptual encodings, structured memory, and more powerful compositional policy representations. Multi-task and embodied-native pretraining offer promise in narrowing the generalization gap. Table30’s persistent task diversity and fine-grained difficulty tagging make it a durable benchmark for future progress in embodied policy research (Yakefu et al., 20 Oct 2025, Yu et al., 16 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Table30.