TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Published 10 Mar 2026 in cs.RO | (2603.09971v1)

Abstract: We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper demonstrates that a modular integration of foundation models with GPU-parallelized task and motion planning achieves high success rates in multi-step robotic manipulation without robot-specific training data.
It details a three-part architecture—perception, planning, and execution—that efficiently combines zero-shot scene interpretation with concrete trajectory generation to handle open-vocabulary tasks.
Empirical evaluations reveal up to 100% success in distractor-rich scenarios and faster task completions, underscoring TiPToP's potential for scalable, cross-platform deployment.

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

System Architecture and Pipeline

TiPToP is proposed as a fully modular, extensible system for language-conditioned robotic manipulation that unifies contemporary foundation models for perception with efficient task and motion planning (TAMP) to execute complex, multi-step tasks over everyday objects and open-vocabulary language instructions. The system forgoes any robot training data or embodiment-specific fine-tuning, instead leveraging inference-time foundation models as perception frontends, while planning is centralized in a GPU-parallelized TAMP backend. The architecture is partitioned into three principal modules: (1) a Perception Module integrating vision and LLMs for scene interpretation and goal grounding, (2) a cuTAMP-based Planning Module for discrete-continuous task synthesis, and (3) an Execution Module responsible for accurate trajectory tracking.

Figure 1: System overview demonstrating the flow from stereo RGB images and language to planned robot trajectories.

Perception: Foundation Models and Open-World Semantics

The perception stack operates at $t=0$ (single-shot, open-loop), taking as input a calibrated stereo wrist-camera image pair and an unconstrained natural language command. The 3D Vision Branch utilizes FoundationStereo for dense, robust zero-shot depth estimation, outperforming device-specific stereo matching—specifically on specular, transparent, or low-texture surfaces—followed by unprojection to a world-frame-aligned scene point cloud.

Grasp prediction is performed via M2T2 on the full scene point cloud, providing 6-DoF candidate grasps, which are subsequently associated with object-level masks. The Semantic Branch queries a VLM (Gemini Robotics-ER 1.5) to extract precise 2D bounding boxes, object labels, and a formal goal graph (currently over $On(a, b)$ predicates), integrating vision and language for task-relevant object grounding and semantic disambiguation. Segment Anything Model v2 (SAM-2) refines bounding boxes to pixel-level segmentation, extracting individual object geometries.

Figure 2: Perception results with depth estimation (left), neural grasping confidence (middle), and semantic goal specification (right).

Object-centric scene construction combines instance masks with point cloud geometry, reconstructing watertight convex hull object meshes for downstream collision checks, and assigns predicted grasps using spatial proximity filtering.

Figure 3: SAM-2 segmentation masks generated from VLM-detected bounding boxes, supporting mesh extraction and perception-planning integration.

Planning: GPU-Accelerated Task and Motion Optimization

Task specification, in the form of a predicate conjunction over detected objects, is translated into a PDDL-style symbolic goal. cuTAMP enumerates feasible plan skeletons (e.g., via pick-and-place compositions and primitives), instantiates these skeletons by sampling continuous parameters from perception output and heuristics, and performs differentiable batch optimization over particle populations per skeleton to resolve discrete-continuous coupling and constraint satisfaction (e.g., collision-freedom, stable placements, kinematic feasibility).

Particles that survive optimization initiate parallel GPU motion planning via cuRobo, ultimately producing a timed joint-space and gripper trajectory for deterministic execution. The modular planning interface admits straightforward extension with additional primitives or symbolic predicates.

Execution: Trajectory Tracking and Open-Loop Limitations

The produced trajectory is tracked with a bespoke joint impedance controller, empirically found to yield higher precision than open-source alternatives such as DROID’s Polymetis. Importantly, the system executes open-loop (no visual feedback during or after primitive completion), meaning demarcation between perception, plan synthesis, and execution is strictly enforced. This determinism enables deep failure tracing but also exposes the system to errors from unrecovered execution failures (e.g., grasp reattempts are not supported in the current implementation).

Empirical Results and Comparative Evaluation

TiPToP is evaluated in 28 unique rearrangement and pick-and-place scenarios (simulation and real-world tabletop environments) via 165 trials, with extensive comparisons against $\pi_{0.5}$ -DROID, a state-of-the-art vision-language-action (VLA) policy model fine-tuned on 350 hours of demonstration data on the same hardware. Critical evaluation axes included:

Open-vocabulary semantic composition, i.e., instructions such as "serve peanut butter crackers on each tray," requiring nuanced cultural and referential reasoning.
Generalization to distractor-rich, multi-step, and semantically ambiguous tasks.
Zero-shot cross-embodiment deployment: The same TiPToP software stack was ported with minimal engineering effort to multiple robot platforms (e.g., UR5e, Trossen WidowX AI).

Strong numerical results were observed:

On semantic, distractor, and multi-step tasks, TiPToP unambiguously outperforms $\pi_{0.5}$ -DROID, e.g., success rates on hard distractor tasks up to 100% versus 20% for the VLA baseline, and in multi-step tasks 75% versus 52%.
Task progress is systematically higher for TiPToP, and in many cases, non-successful runs still achieve most subgoals, evidencing robustness to partial failures.
Average completion time for successful trials favors TiPToP: e.g., 14–18s for TiPToP versus 32–45s for $\pi_{0.5}$ -DROID on real-world tasks.

Failure Analysis and Module Attribution

173 additional targeted experiments yielded granular failure diagnoses via module-level tracing:

Figure 4: Module-level failure analysis identifying grasping, mesh completion, VLM prediction, and TAMP optimization as bottlenecks.

Grasping failures (e.g., non-contact or unstable picks) dominate (31/55 failures), caused by imperfect predictions of M2T2 (scene-level), heuristic grasp generation for missing objects, and lack of closed-loop re-attempting.
Scene completion and mesh approximation errors, especially convex hulls over concave or occluded geometries (bananas), lead to infeasible plans or excessive collision conservatism.
VLM errors (incorrect or missing detections/bounding boxes); these affect symbolic grounding and mask extraction.
cuTAMP failures: inability to find feasible plans within compute budget, usually in heavily cluttered scenes.

The strengths of the modular design are most evident here: each failure is attributable to an independent block, supporting targeted research and engineering improvements.

System Modularity, Extensions, and Implications

The architectural separation facilitates both rapid extension and practical deployment. The addition of new low-level skills (e.g., whiteboard wiping, demonstrated in the paper) requires only local changes: new symbolic predicates, action primitives in TAMP, and semantic branch prompt extensions.

Figure 5: Extension to wiping (beyond pick-and-place), pairing language, perception, and motion control.

Furthermore, the ability to deploy on arbitrary robots is demonstrated, contingent only on URDF and controller integration—no retraining or perception/planning modifications are required. This supports credible claims of enhanced reproducibility and ease of benchmarking in robotics.

Theoretical and Practical Implications

TiPToP serves as direct evidence that late-binding modular systems, constructed from powerful foundation models and planners, can approach or exceed the task coverage of large, embodiment-specific, end-to-end VLA policies, without reliance on robot training data, demonstration collection, or joint policy tuning. The work highlights the tradeoff between explicit symbolic/geometric reasoning (supporting robust open-vocabulary and compound task generation; rapid debugging) and reactive closed-loop policies (supporting robust error recovery, compliant actuation, and real-world execution drift). The integrated approach advocated—modular planning systems informed by foundation models—sets a viable direction for tightly coupling learning and planning, and for benchmarking modular versus monolithic architectures.

For future work, the authors underscore:

Closed-loop planning or belief-space TAMP to recover from execution mistakes and support reactivity;
Improved shape reconstruction, e.g., multi-view perception, neural implicit meshes (SAM-3D);
Training or adaptation of grasping policies for higher success on challenging objects;
Hybridization with VLA policies as closed-loop reactive primitives;
Automated predicate induction and policy abstraction via learning.

This direction is expected to underpin the next generation of generalizable, high-assurance robotic manipulation systems.

Conclusion

TiPToP establishes a new reference architecture for planning-based manipulation via modular integration of vision, language, and planning foundation models. The system demonstrates that high-level compositionality, transferable deployment, and open-vocabulary semantics are feasible at scale, without training data, and are competitive with (and often superior to) leading VLA systems—particularly in semantic, distractor, and multi-step tasks. The modular structure is vital for identifying, reproducing, and correcting system-level failures, and promises fluid adaptation as improved vision and policy backends emerge. The open-source release and cross-platform compatibility make TiPToP a robust foundation for future modular manipulation research and for exploring synergies between structured planning and data-driven control.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about

The paper introduces TiPToP, a robot system that can understand a natural-language instruction (like “put the crackers on each tray”), look at camera images of a table, and plan a step-by-step way to move objects to complete the task. It’s designed to “just work” without special training data for each robot, and to be easy to install and adapt to different robot arms.

What the researchers wanted to find out

They asked three simple questions:

How well can TiPToP do everyday, multi-step robot tasks from plain language, compared to a powerful learned system that needs lots of training?
How fast is TiPToP at finishing tasks?
When TiPToP fails, why does it fail?

How TiPToP works (in everyday language)

TiPToP has three main parts: seeing, planning, and doing.

1) Seeing (Perception)

The robot starts with two images from a small camera on its wrist (like your two eyes). From these, it:

Builds 3D: It uses a “stereo depth” model that turns the two images into a 3D “point cloud,” which is like a sculpture made of tiny dots showing where surfaces are in space.
Finds objects and their names: It asks a vision-and-LLM (a very smart image+text AI) to draw boxes around objects in the picture and label them (e.g., “peanut butter crackers,” “tray”). It also turns the instruction (“put crackers on each tray”) into a clear goal the planner can use, like “On(crackers, tray).”
Outlines shapes: It uses a segmentation tool (think: coloring inside the lines) to get each object’s exact outline in the image.
Suggests grasps: It proposes many possible ways the robot could grab items in 3D.

Finally, it combines all this to make simple 3D shapes for each object (like a shrink-wrap around the object) and attaches the best candidate grasps to each object.

Key ideas explained:

Stereo depth: like how your two eyes help you sense distance.
Point cloud: a 3D map made of many dots.
Segmentation: coloring the exact pixels of an object.
Vision-LLM (VLM): an AI that can look at a picture and understand a text instruction.
Grasp candidates: different hand positions the robot could use to pick something up.

2) Planning (Thinking ahead)

TiPToP uses a method called Task and Motion Planning (TAMP). Think of it like writing a recipe (“pick up crackers,” “move to tray,” “place”) and also checking if each step is physically possible (no bumps, reachable by the arm, stable placements). It:

Lists possible action sequences (like different recipes to reach the goal).
Makes many quick guesses for the exact details (where to grab, where to place) and improves them using fast math on a graphics card (GPU), like having hundreds of assistants testing ideas in parallel.
Finds a smooth, collision-free path for the arm to follow.

Key ideas explained:

TAMP: planning the high-level steps and the detailed arm motions together.
GPU acceleration: using powerful parallel processors to try many options fast.

3) Doing (Execution)

The robot then follows the planned path like a choreographed dance, sending joint motions to the arm and opening/closing the gripper at the right times.

Note: TiPToP currently executes “open-loop,” which means it doesn’t look again while moving. It trusts the plan and doesn’t adjust mid-action. That’s fast, but if something slips, it won’t correct itself unless you re-run the whole step.

What they found and why it matters

They tested TiPToP on 28 tabletop tasks both in simulation and on real robots (Franka, UR5e, WidowX), running 165 trials in total, and compared it to a strong learned system called π₀.₅-DROID that was trained on 350 hours of robot demonstrations.

Main results:

Success: TiPToP matched or beat the trained system overall, especially on:
- Tasks with many distractor objects (it picked the right item among many).
- Tasks with tricky language (like “largest toy” or “matching plate”).
- Multi-step tasks (like moving something out of the way first, then placing).
Speed: It usually finished tasks faster, often about half the time on simple tasks. Planning the whole path up front made execution quick and purposeful.
Easy setup: It can be installed in under an hour on common setups and adapted to new robot arms with modest effort. No new robot-specific training data is needed.

Where it struggled:

Grasping: Most failures were from missed or unstable grasps. Because it doesn’t re-check during motion, it can’t retry automatically.
Shape approximations: It uses simple “shrink-wrap” shapes (convex hulls) for objects from one camera view. This can be wrong for bendy or concave shapes (like bananas), causing planning or collision mistakes.
Tiny objects: Very small items are hard to pick reliably in a single try.

Why this is important:

It shows that smart “modules” (seeing + planning) built from general-purpose AI models can rival systems that need lots of robot training data.
The system is modular, so each part can be swapped or upgraded as better AI tools appear (better depth, better grasping, better language understanding, faster planners).
It’s open-source, making it easier for others to build on and compare with.

What this could lead to next

The authors suggest practical upgrades that a team could add over time:

Look-and-replan: After each step, look again and fix mistakes (e.g., retry a slipped grasp).
See from more angles: Use more cameras or move the camera first to get better 3D shapes.
Better object shapes: Use new “shape completion” tools that guess a full 3D shape from limited views.
Mix “thinking” with “reacting”: Combine TiPToP’s careful planning with reactive learned skills (like a learned grabber that can adjust mid-move) to get the best of both worlds.

In simple terms, TiPToP is like a careful organizer: it looks, understands, plans a detailed sequence, and executes smoothly. With a bit more “street smarts” during execution—like checking and adjusting on the fly—it could become even more reliable for real-world chores.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored, framed to guide actionable follow-up work.

Closed-loop execution and recovery: add execution monitors (vision/tactile/force) to detect grasp/placement failure, support grasp retries, mid-trajectory corrections, and step-wise replan; quantify recovery gains versus open-loop execution.
Multi-view and active perception: plan next-best-view(s) or fuse multiple static cameras to reduce occlusions and improve geometry; evaluate improvements in grasp success and collision rates.
Shape reconstruction beyond convex hulls: replace single-view convex hulls with learned 3D completion or implicit meshes; benchmark collision false positives/negatives and placement stability versus ground-truth meshes.
Uncertainty-aware (belief-space) TAMP: propagate uncertainty from detection, segmentation, and depth into grasp/placement and collision constraints; evaluate robust plans and information-gathering actions.
Richer symbolic predicates and language grounding: systematically extend beyond On(a,b) to In/Inside/Under/LeftOf/Clear/ContainedIn/Open/Close/Containment/Stacking; support quantifiers and counts (each/exactly-N); validate predicate correctness against the scene.
Learning preconditions/effects for new skills: automatically infer abstract models for skills (e.g., wiping, pushing) from data; verify that learned models enable correct plan sequencing under TAMP.
Integrating reactive learned skills: define interfaces for invoking VLA or visuomotor policies as feedback-controlled primitives within plans; specify/learn their preconditions/effects and handoff logic.
Grasp robustness and selection: add collision-aware grasp filtering, grasp-quality simulation, and multi-attempt strategies; incorporate tactile/force feedback; support diverse grippers (suction, multi-finger) and evaluate cross-gripper generalization.
Small-object manipulation: develop specialized sensing (higher-res, macro depth) and grasp strategies for tiny or low-profile items; quantify improvements on AirPods/cashew-like tasks.
Slip detection and compliant control: use F/T and finger position signals to detect incipient slip; modulate grip force, regrasp, or adjust trajectories; evaluate reduction in transport losses.
Dynamic and human-in-the-loop environments: support online replanning under moving obstacles/humans; characterize latency/throughput required for safe, responsive behavior.
Explicit obstruction reasoning: add pre-grasp approach/visibility analysis and clearance tests to decide when to move obstructions, rather than relying on optimization failures to reveal the need.
Automatic capture-pose selection: plan a wrist-camera pose (or sequence) to maximize task-relevant coverage given kinematics and occlusions; measure impact versus a fixed capture pose.
Cross-embodiment scaling: test on a wider range of arms and grippers; quantify porting effort, performance variability, and required controller tuning across embodiments.
Compute footprint and deployment constraints: profile runtime, GPU memory, and latency per module (depth, VLM, TAMP) across hardware; demonstrate feasibility on on-robot compute and cloud-free settings.
Reliance on proprietary VLMs: evaluate open/local VLM alternatives for detection/grounding (accuracy, latency, cost); study robustness to prompt sensitivity and model/version drift; design fallbacks when VLM calls fail.
Calibration robustness: analyze sensitivity to camera intrinsics/extrinsics error and joint encoder drift; add auto-calibration/online refinement; quantify downstream effects on grasp/placement success.
Physical property uncertainty: incorporate estimates of mass, friction, and contact geometry into stability constraints; learn/update these online; measure impact on stacking and “on vs in” semantics.
Motion planning in heavy clutter: characterize cuRobo failure modes/timeouts; add fallback planners or learned guidance; set adaptive time budgets and multi-seed strategies.
Plan skeleton scalability: study complexity and pruning when objects/tasks scale; add heuristic or learned skeleton proposal/guidance; report worst-case and average enumeration costs.
Broader, standardized evaluation: benchmark on established suites (e.g., RLBench, BEHAVIOR, ManiSkill, CALVIN) and compare to additional baselines (PDDLStream, VoxPoser, OWL-TAMP, LLM3); report confidence intervals and statistical significance.
Task diversity beyond tabletop pick-and-place: add primitives and evaluations for articulated-object manipulation, tool use, nonprehensile actions (pushing, sliding), deformables/cables, and long-horizon mobile manipulation.
Semantic verification at execution: after acting, re-check that goals are satisfied (e.g., on vs in/inside) with perception; trigger corrective actions if semantic predicates are not met.
End-to-end error propagation: quantify how detection/segmentation/depth errors translate into grasp/plan failures; identify the most impactful error sources to prioritize model upgrades.
Robust segmentation/detection under occlusion and clutter: compare SAM-2 plus box prompts to 3D instance segmentation and open-vocabulary detectors; evaluate small-object and heavy-occlusion regimes.
Fairness and controls in comparisons: control for sensor differences (stereo vs monocular, external cameras), timing protocols, and idling; run matched-sensing ablations to isolate architectural effects.
Safety analysis and constraints: integrate human-safe force/velocity limits and forbidden zones into planning and control; add formal monitors/guarantees for collision avoidance during execution.
Learning from failures: turn logged execution failures into improvements (grasp priors, shape priors, predicate reliability); study online adaptation without embodiment-specific demonstrations.
Persistent world models: maintain object identities and states across tasks/episodes to enable multi-step projects, inventory tracking, and long-horizon goal fulfillment.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications Derived from the Paper

Below are actionable, real-world applications grounded in TiPToP’s findings, methods, and innovations, grouped by deployment horizon. Each item identifies sectors, concrete use cases, potential tools/workflows, and key assumptions or dependencies that impact feasibility.

Immediate Applications

Industry — Light manufacturing, kitting, and assembly
- Use cases:
- Language-driven kitting and packing: “Put one of each item on each tray/bay,” “Pack pods onto tray,” “Place X on Y while avoiding obstacles.”
- Distractor-rich pick-and-place on changing SKUs without retraining (e.g., small-batch, high-mix lines).
- Obstruction-aware manipulation (moving a blocking item to access a target).
- Tools/workflows:
- “Semantic workcell” interface that accepts natural-language task templates and images.
- A TiPToP-powered kitting station where operators specify goals in plain English.
- Onboarding workflow: import URDF, calibrate a stereo wrist camera, run built-in cuRobo config, and deploy.
- Assumptions/dependencies:
- Static or quasi-static scenes; single-view stereo capture at t=0; open-loop execution.
- Requires GPU (CUDA) and wrist-mounted stereo or dual-camera rig; accurate calibration.
- Convex-hull mesh approximation can fail on strongly concave items (e.g., bananas).
Warehousing and fulfillment
- Use cases:
- Bin sorting and order consolidation with open-vocabulary instructions: “Sort blocks by color,” “Place the largest toy onto the purple plate/bin.”
- Distractor rejection in cluttered totes without custom dataset collection.
- Tools/workflows:
- “Natural-language sorter” station using TiPToP as a drop-in inference-time planner for tote-to-bin routing.
- Autogenerated subgoals via VLM-predicated goal grounding for SKU variants.
- Assumptions/dependencies:
- Object sizes compatible with a parallel gripper; single-viewpoint limitations; cloud or local VLM availability.
Labs and R&D (academia and industry)
- Use cases:
- Out-of-the-box baseline for multi-step tabletop manipulation from pixels and language—no robot training data needed.
- Component-level failure auditing and ablations (depth, segmentation, grasps, planning) using TiPToP’s modular stack.
- Cross-embodiment evaluation (e.g., Franka, UR5e, WidowX) with minimal porting effort.
- Tools/workflows:
- Benchmark harness comparing TAMP vs. VLA vs. hybrid methods on semantic/multi-step tasks.
- Dataset generation of successful trajectories for imitation or RL fine-tuning; synthetic tasks in Isaac Sim.
- Assumptions/dependencies:
- GPU for cuTAMP/cuRobo; access to foundation models (SAM-2, FoundationStereo, VLM); correct URDF and collision spheres.
Service robotics (offices, labs, controlled home-like spaces)
- Use cases:
- Desk/lab tidying with semantic grounding: “Place peanut-butter crackers on each tray,” “Put markers into the cup.”
- Simple wiping/cleaning tasks when extended with the provided Wipe primitive.
- Tools/workflows:
- Operator-in-the-loop language interface to issue episodic tasks; pre-validated plan preview.
- Assumptions/dependencies:
- Limited to tabletop-like, static scenes and open-loop execution; wiping assumes known surface and reachable region.
Education and training
- Use cases:
- Teaching TAMP, 3D perception, and VLM grounding in robotics courses without large-scale data collection.
- “Skill extension labs” where students add new predicates/operators (e.g., Wipe) over a weekend.
- Tools/workflows:
- Reproducible course assignments in Isaac Sim; plug-and-play lab demos on common arms.
- Assumptions/dependencies:
- Entry-level GPU workstation; institution-friendly licensing for VLMs and vision models.
Systems integration and prototyping for robotics companies
- Use cases:
- Rapid proof-of-concept demos for prospective customers without collecting demonstrations.
- Wrapping existing industrial grippers/cameras with a language-conditioned TAMP planner.
- Tools/workflows:
- TiPToP adapter kits: embodiment config, camera calibration utilities, controller bridges.
- Pre-flight “plan and simulate” workflow to verify collision-free trajectories before execution.
- Assumptions/dependencies:
- Time-optimal open-loop trajectories require precise joint tracking; impedance control tuning is critical.
Process quality and safety auditing (policy/assurance)
- Use cases:
- Component-level “explainability” for failures (perception vs. planning vs. control) in acceptance testing and incident reports.
- Tools/workflows:
- Automated logs and Sankey-style dashboards that classify failure modes across trials.
- Assumptions/dependencies:
- Requires standardized task suites and labeling rules for “task progress” vs. “success” metrics.
Hobbyist and maker ecosystems (daily life)
- Use cases:
- Entry-level manipulation on affordable arms (e.g., Trossen WidowX AI) for household sorting, toy cleanup, and educational projects.
- Tools/workflows:
- Community-contributed predicate libraries and embodiment configs; GUI wizards for calibration.
- Assumptions/dependencies:
- Lighter hardware may limit payload/precision; stereo perception recommended.

Long-Term Applications

Hybrid reactive-planning robotics across sectors (manufacturing, logistics, service, healthcare)
- Vision:
- Combine TiPToP’s semantic TAMP with closed-loop VLAs as reactive primitives for grasp retries, slippage recovery, and fine manipulation (folding, cable tasks).
- Potential products:
- “TiPToP Hybrid Controller”: a planner that calls VLA-based micro-skills with declarative preconditions/effects.
- Dependencies:
- Robust skill abstractions (learned or engineered); safety guarantees for black-box policies.
Robust, uncertainty-aware deployment (belief-space TAMP)
- Vision:
- Active perception and information-gathering actions; multi-view shape completion; probabilistic reasoning about grasp and placement uncertainty.
- Potential products:
- “Perception-first planning” module that moves cameras to reduce occlusions before solving.
- Dependencies:
- Extensions of cuTAMP to belief space; faster multi-view depth/shape completion (e.g., SAM-3D-class methods).
Home assistance and eldercare
- Vision:
- Reliable household tidying, dish/tray preparation, and surface cleaning with semantic grounding and safer, reactive execution.
- Potential products:
- Natural-language caregiver assistant that executes checklists (“Set up the breakfast tray,” “Clear the table”) with plan previews for approval.
- Dependencies:
- Safety certification, reliable human-aware motion, handling of deformable/fragile objects, privacy-first on-device perception/VLMs.
Hospitals and clinical logistics
- Vision:
- Non-sterile supply handling, sorting, and room turnover support (e.g., placing items on designated trays/carts; wiping non-critical surfaces).
- Potential products:
- “Semantic cart restockers” controlled by nurses via simple sentences.
- Dependencies:
- Infection control, robust ID of medical supplies, rigorous fail-safes, regulatory compliance.
Retail restocking and backroom operations
- Vision:
- Language-directed put-away and facing: “Place the largest cereal on the top shelf,” “Group by brand/color.”
- Potential products:
- Voice-driven backroom assistants that adapt to changing SKUs without retraining.
- Dependencies:
- Handling of concave, deformable, or reflective packages; extended reach and shelf-aware planning.
Construction and recycling (material sorting)
- Vision:
- Sorting heterogeneous items by category or attribute without per-class training.
- Potential products:
- Open-vocabulary sorter cells for demolition debris or e-waste triage.
- Dependencies:
- Heavy-duty end-effectors, dust/lighting robustness, multi-view perception, high-throughput constraints.
Agricultural post-harvest handling
- Vision:
- Gentle pack-and-place by size/grade/color; open-vocabulary sorting (“Place ripe tomatoes into tray B”).
- Potential products:
- On-farm semantic kitting and packing assistants.
- Dependencies:
- Deformable, delicate grasping; humidity/dirt robustness; fast re-planning in dynamic bins.
Standardization and policy (auditability, certification)
- Vision:
- Module-level transparency as a regulatory asset for certifying autonomy in shared workspaces (traceable failure causes; deterministic planning logs).
- Potential products:
- Compliance toolkits that replay and attribute failures to perception/planning/control components with metrics like Task Progress.
- Dependencies:
- Sector-specific standards for logging, explainability, and human-in-the-loop overrides.
Commercialization of a modular manipulation SDK
- Vision:
- A supported “TiPToP Pro” offering with:
- Embodiment adapters (Franka, UR, ABB, Fanuc), GUI task editors, plan simulators, and fleet management.
- Predicate and skill marketplaces (community and vendor-contributed).
- Dependencies:
- Long-term support for foundation models (licensing, on-device variants), sustained GPU availability, vendor controller integrations.
Data generation and simulation-for-learning at scale
- Vision:
- Automated curriculum generation of multi-step, semantic tasks to pretrain or finetune VLAs and grasp networks.
- Potential products:
- “TiPToP Task Forge”: Isaac Sim pipelines that synthesize labeled trajectories and failure taxonomies for training robust policies.
- Dependencies:
- Domain randomization/bridging to real; scalable simulation infrastructure.
Human-robot interaction with interactive goal refinement
- Vision:
- Robots that ask clarifying questions when VLM grounding is uncertain and update the plan on-the-fly.
- Potential products:
- Conversational plan editors that visualize the TAMP skeleton, allow human edits, and re-optimize in seconds.
- Dependencies:
- Real-time re-planning loop, uncertainty estimates from VLMs, tight latency bounds for mixed-initiative control.

Notes on cross-cutting assumptions and risks:

Hardware/software requirements: CUDA-capable GPU; accurate camera extrinsics/intrinsics; URDF and controller with precise joint tracking; stereo or dual cameras preferred.
Perception scope: Single-viewpoint depth and convex hull meshing can misrepresent concave/deformable/reflective objects; multi-view or learned shape completion mitigates this.
Execution: Current open-loop assumption limits robustness to slips or failed grasps; adding closed-loop retries and re-planning is essential for high reliability.
Data/privacy/licensing: Some VLMs and models may require cloud access or specific licenses; on-device models may be needed for sensitive environments.
Safety: Industrial and healthcare deployments require certified safety layers, human-aware motion, and fault detection before executing open-loop plans.

View Paper Prompt View All Prompts

Glossary

6-DoF: Six degrees of freedom describing full 3D position and orientation of an object or end-effector. "predict 6-DoF grasp poses from point clouds."
Belief-space planning: Planning over probability distributions of states to handle uncertainty. "Belief-space planning."
camera intrinsics: Parameters defining a camera’s internal geometry (focal length, principal point, etc.). "camera intrinsics $K$ "
camera-to-end-effector extrinsics: Rigid transform from the camera frame to the robot’s end-effector frame. "camera-to-end-effector extrinsics"
Cartesian control: Control in task-space (x, y, z, orientation) rather than joint space. "IK-based Cartesian control."
convex hull: The smallest convex shape enclosing a set of points; used to approximate object geometry. "compute the convex hull to form a watertight mesh."
cuRobo: A GPU-accelerated motion planning library for fast trajectory computation. "cuRobo~\citep{sundaralingam2023curobo}, a GPU-accelerated motion planner"
cuTAMP: A GPU-parallelized Task and Motion Planning algorithm for optimizing discrete and continuous variables. "TiPToP uses cuTAMP~\citep{shen2025cutamp}, a GPU-parallelized Task and Motion Planning algorithm,"
DROID: A standardized robot hardware/software platform used for benchmarking. "a standard DROID setup"
FoundationStereo: A foundation model for stereo depth estimation from RGB image pairs. "We use FoundationStereo~\citep{wen2025foundationstereo}, a foundation model for stereo depth estimation,"
forward kinematics (FK): Computing the pose of the robot’s end-effector from joint angles. "forward kinematics (FK)"
Gemini Robotics-ER 1.5: A large vision-LLM used for object detection and goal grounding. "Gemini Robotics-ER 1.5~\citep{team2025gemini}, a VLM,"
grasp generation: Predicting feasible grasp poses for objects from sensor data. "Foundation models for grasp generation~\citep{murali2025graspgen, sundermeyer2021contact, yuan2023m2t2}"
inverse kinematics: Computing joint angles that achieve a desired end-effector pose. "inverse kinematics."
IsaacSim: A physics-based simulator for robotics experimentation and development. "IsaacSim~\citep{NVIDIA_ISAAC_SIM}"
joint impedance controller: A controller that regulates motion by simulating compliance (stiffness/damping) in joint space. "joint impedance controller."
KDTree: A spatial data structure for efficient nearest-neighbor queries in k-dimensional space. "KDTree~\citep{bentley1975multidimensional}"
M2T2: A model that predicts ranked 6-DoF grasp poses from point clouds. "We use M2T2~\citep{yuan2023m2t2} to predict ranked 6-DoF grasp poses"
Motion primitive: A parameterized low-level action (e.g., pick, place) used to compose complex behaviors. "operates primarily over pick-and-place primitives"
On(a, b): A symbolic predicate specifying that object a should be placed on object or surface b. "( $\text{On}(a, b)$ specifies that object $a$ should be placed on object or surface $b$ )"
open-loop: Executing a precomputed plan without using feedback during execution. "executed open-loop with no further visual observations."
Particle Initialization: Sampling initial continuous parameters (grasps, placements, configurations) for optimization in TAMP. "Particle Initialization."
Particle Optimization: Differentiable refinement of sampled parameters to satisfy constraints in TAMP. "Particle Optimization."
PDDL: A formal language (Planning Domain Definition Language) for describing planning problems. "PDDL-style symbolic planner"
PDDLStream: A TAMP framework that augments PDDL planning with sampling-based continuous reasoning. "PDDLStream~\citep{garrett2020pddlstream}"
plan skeletons: Sequences of abstract actions without bound continuous parameters. "plan skeletons --- sequences of symbolic actions without committed continuous parameters."
predicate: A symbolic relation over objects used to specify goals/conditions in planning. "a symbolic goal $\mathcal{G}$ expressed as a conjunction of predicates"
RANSAC: A robust method for model fitting in noisy data by iteratively sampling consensus sets. "RANSAC~\citep{fischler1981random}"
SAM: Segment Anything Model for promptable image segmentation. "SAM~\citep{kirillov2023segany}"
SAM-2: An improved promptable segmentation model for pixel-level masks. "SAM-2~\citep{ravi2024sam2}"
stereo baseline: The distance between two cameras in a stereo pair used for depth triangulation. "stereo baseline $b$ "
Task and Motion Planning (TAMP): Planning that jointly reasons over discrete actions and continuous motions/constraints. "Task and Motion Planning (TAMP)~\citep{kaelbling2011hierarchical,garrett2021integrated,curtis2022tamp}"
unprojecting depth: Converting depth pixels into 3D points in camera/world coordinates. "Unprojecting depth to 3D."
URDF: Unified Robot Description Format, an XML format describing a robot’s kinematics and geometry. "The embodiment must possess a camera, gripper, URDF, and trajectory tracking controller to be supported."
Vision-Language-Action (VLA): Models that map images and language instructions directly to robot actions. "Vision-Language-Action (VLA) models such as $\pi_{0.5}$ ~\citep{black2025pi05} and OpenVLA~\citep{kim2024openvla} offer an appealing input-output specification"
Vision-LLM (VLM): Models that jointly process visual inputs and text to perform tasks like grounding and detection. "VLMs~\cite{team2023gemini, deitke2025molmo, bai2025qwen3vltechnicalreport, openai2024gpt4ocard} combine vision and language understanding"

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Summary

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

System Architecture and Pipeline

Perception: Foundation Models and Open-World Semantics

Planning: GPU-Accelerated Task and Motion Optimization

Execution: Trajectory Tracking and Open-Loop Limitations

Empirical Results and Comparative Evaluation

Failure Analysis and Module Attribution

System Modularity, Extensions, and Implications

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about

What the researchers wanted to find out

How TiPToP works (in everyday language)

1) Seeing (Perception)

2) Planning (Thinking ahead)

3) Doing (Execution)

What they found and why it matters

What this could lead to next

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications Derived from the Paper

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets