RobotSmith: Automated Robotic Tool Design
- RobotSmith is an end-to-end generative framework that designs robotic tools using iterative VLM-based proposals and physics simulation refinements.
- It combines vision-language critique with parameterized geometric assembly and joint optimization of tool shape and trajectory to enhance manipulation performance.
- Experimental results show a five-fold increase in success rate over baselines, enabling effective sim-to-real transfer in diverse manipulation scenarios.
RobotSmith is an end-to-end generative framework for automated robotic tool design and usage policy synthesis. It explicitly integrates vision–LLMs (VLMs) with physics-based simulation and joint optimization. The system enables robots to both invent and use domain- and task-specific tools for complex manipulation scenarios, addressing the limitations of template-based and generic 3D generation approaches that lack physical realism or task-awareness. RobotSmith combines iterative VLM agent design, collision-aware manipulation planning, and simulation-driven geometric and behavioral refinement, resulting in notable gains in manipulation performance and sim-to-real transfer capability (Lin et al., 17 Jun 2025).
1. Modular Pipeline and System Architecture
RobotSmith is structured as a three-stage pipeline:
- Critic Tool Designer: A VLM-based inner loop iteratively proposes, critiques, and refines a parameterized tool representation. The process employs a Proposer–Critic structure: the Proposer (GPT-3 o3-mini), provided with the task description, scene, geometry APIs, and a JSON assembly scheme, generates candidate tools. The Critic evaluates rendered multi-view images and design rationale, enforcing semantic and geometric constraints (connectivity, graspability, reach).
- Tool Use Planner: The Tool User, also VLM-based, infers a high-level program for tool usage. Three domain-specific APIs—grasp, move, and release—allow for abstract policy synthesis compatible with conventional motion planning and inverse kinematics modules.
- Joint Optimizer: Using CMA-ES, tool geometry and trajectory waypoints are optimized together in simulation. The optimizer samples parameter vectors, evaluates performance, and updates the search, balancing simulation-based task reward, VLM-derived penalties for semantic or geometric invalidity, and regularization for manufacturability and useful grasp points.
The workflow is summarized in the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Input: task T, scene S₀ G ← NULL repeat G_proposed ← Proposer(T, S₀) # JSON with parts/assembly render_views ← Render(G_proposed, S₀) feedback ← Critic(T, S₀, render_views) G ← Proposer.refine(feedback) until feedback == "DONE" q₀ ← ToolUser.generate_trajectory(T, S₀ ∪ G) initialize CMA-ES on θ₀ = (s₀, q₀) # Tool & trajectory parameters for iter = 1…50 do {θ₁,…,θ_λ} ← CMA_ES.sample() for each θᵢ = (sᵢ, qᵢ): Mᵢ ← Simulator.evaluate(S₀, sᵢ, qᵢ) CMA_ES.update({θᵢ, Mᵢ}) return best (s*, q*) |
Each tool is structured as a parameterized assembly of geometric primitives or 3D-generated parts, specified for manufacturability and semantic validity.
2. Joint Optimization Formulation
RobotSmith jointly optimizes the tool shape (parameter vector ) and the usage policy (trajectory ) to maximize task-specific performance, as evaluated by a composite objective:
where:
- is the task performance metric (from physics simulation).
- penalizes disconnected parts, poor graspability, or failed semantic constraints (as per the Critic).
- encodes regularizers for mesh complexity, part volume bounds, and the requirement that precisely one part is graspable.
Typical hyperparameters: . Sampling constraints include scale (), translation ( m), and rotation ().
3. Vision–LLM Roles and Constraints
The integration of VLM agents is central in RobotSmith. The Proposer agent is furnished with structured prompts and scene context to propose assemblies and design rationales, leveraging both API-based primitive geometry and text-to-3D model outputs. The Critic accepts rendered images and textual rationales and provides either a terminal "DONE" or specific edit instructions to correct geometric or functional defects.
The Critic loop encodes rules for:
- Connectivity (no floating or disconnected parts)
- Graspability (enforced uniqueness of graspable part)
- Clearance (component reachability and collision avoidance)
These priors allow the pipeline to prune infeasible or semantically invalid designs early, reducing optimizer dead-ends and accelerating convergence to manufacturable, robot-usable tools.
4. Task and Experimental Coverage
RobotSmith was validated on 9 manipulation tasks spanning rigid, deformable, and fluid domains:
- Reach (cube out-of-reach)
- Hold a Phone (upright stabilization)
- Lift a Bowl (without inner contact)
- Lift a Piggy (container retrieval)
- Dough Calabash (deform shaping)
- Flatten Dough
- Cut Dough
- Fill a Bottle (liquid transfer)
- Transport Water (tank-to-cup)
Each task was evaluated in 8 trials, with normalized task score . Principal metrics:
- : best score per task
- Success Rate: fraction of trials with
| Method | Success Rate | |
|---|---|---|
| No tool | 0.24 | 2.8% |
| Retrieval | 0.53 | 11.1% |
| Meshy (3D Gen) | 0.72 | 21.4% |
| RobotSmith (Ours) | 0.94 | 50.0% |
RobotSmith outperforms baselines by a factor of more than two in success rate. Statistical reporting is based on 8 trials per task (Lin et al., 17 Jun 2025).
5. Trajectory Synthesis and Execution
The Tool User agent scripts high-level usage plans employing three core APIs: grasp(obj, euler), move(pos, euler), and release(). Grasp candidates are evaluated via farthest-point pair sampling and filtered by gripper orientation, with physical trial lifts in simulation to ensure feasibility. Moves are solved using inverse kinematics and collision-aware planners. The policy is jointly optimized with tool geometry, removing the need for separate low-level controllers such as PD or MPC.
6. Sim-to-Real Transfer and Physical Deployment
RobotSmith's sim-to-real pipeline utilizes an XArm7 robotic arm with a parallel gripper and 3D-printed PLA tools. The process involves exporting the optimized mesh, fabricating it, and mounting to a calibrated robot platform. Optimized trajectories are replayed on hardware absent domain randomization.
Empirical results demonstrate that for Hold a Phone and Dough Calabash, the robotic setup achieves 10/10 successful real-world task completions. In a long-horizon pancake experiment (flatten, scoop, spread, sprinkle; one tool per subtask), the robot succeeded end-to-end in 3/5 runs, with failures attributed to slight misalignment during sauce spreading.
7. Performance Analysis, Limitations, and Outlook
RobotSmith’s key strengths stem from its combination of VLM-informed semantic priors and physically grounded CMA-ES optimization. The modular parameterization (geometric primitives, assembly code, text-to-3D) admits both diversity and editability, while Critic-based constraint propagation accelerates convergence and robustness.
Observed limitations and failure modalities include:
- Design misalignment: Divergence in interpreted geometry from Proposer intent in text-to-3D outputs (e.g., size/orientation).
- Orientation ambiguity: Crude rotational specifiers can degrade precision in manipulation subtasks.
- Grasp failures: Heavier or awkward shapes are susceptible to slippage in fast motions.
- Optimization complexity: High-dimensional coupled tool-trajectory space can impede CMA-ES progression.
Future research directions are articulated as richer geometric editing (topological changes, reconfiguration), the adoption of differentiable simulators for faster, gradient-driven optimization, and active learning of Critic heuristics from physical deployment experience.
In sum, RobotSmith is the first system to tightly couple large-scale vision–language design priors with end-to-end simulation-based optimization, delivering a five-fold increase in success rate for robotic tool-use tasks over prior approaches and effecting smooth transfer from simulation to physical execution (Lin et al., 17 Jun 2025).