UniPlan: Unified Task & Motion Planning
- UniPlan is a unified task and motion planning framework that integrates extended PDDL domains, visual-topological mapping, and language grounding to enable robust mobile manipulation.
- The system extends basic PDDL with spatial graph embedding, bimanual coordination, and cost optimization, supporting complex navigation and manipulation in large, indoor environments.
- Experimental evaluations show UniPlan achieves over 90% success with lower planning costs and efficient execution in campus-scale scenarios compared to baseline methods.
UniPlan refers to a class of unified task and motion planning frameworks in robotics, artificial intelligence, and operations research, with recent instantiations notably in vision-language-based mobile manipulation planning. These frameworks are characterized by their integration of symbolic planning (typically using PDDL), high-level domain and environment representations, and vision-language grounding to connect perceptual inputs with task specifications. The most prominent modern system titled UniPlan combines extended mobile-manipulation PDDL domains, visual-topological map representations, language and visual grounding, and efficient multi-step planning in large indoor environments (Ye et al., 9 Feb 2026).
1. Unified Domain Representation: Extended Mobile-Manipulation PDDL
UniPlan leverages the Planning Domain Definition Language (PDDL) as the substrate for encoding robot capabilities and environmental structure at scale. The baseline mobile-manipulation domain (including actions such as pick, place, open, and pour) is programmatically extended to support:
- Spatial Graph Embedding: Addition of predicates such as (robot_at_node ?r ?n), (object_at_node ?o ?n), (has_door ?n1 ?n2), (connected ?n1 ?n2), and a numeric function (travel_cost ?n1 ?n2) allow the explicit modeling of navigation topology, door connectivity, and cost structure within large buildings.
- Navigation and Traversability Actions: New operator schemas such as:
are introduced to support moving through a graph and manipulating the environment.1 2 3 4 5 6 7 8 9 10
(:action move_robot :parameters (?r ?from ?to) :precondition (and (robot_at_node ?r ?from) (connected ?from ?to)) :effect (and (robot_at_node ?r ?to) (not (robot_at_node ?r ?from)) (increase (total-cost) (travel_cost ?from ?to))) ) (:action open_door :parameters (?r ?hand ?from ?to) :precondition (and (robot_has_hand ?r ?hand) (robot_at_node ?r ?from) (has_door ?from ?to) (hand_free ?r ?hand) (not (connected ?from ?to))) :effect (and (connected ?from ?to) (connected ?to ?from) (increase (total-cost) 1)) )
- Bimanual Coordination: All manipulation predicates and effects are generalized to account for multiple end-effectors per robot, enabling two-handed coordinated tasks.
- Cost Optimization: All actions increment a global (total-cost) variable; planners are instructed to minimize this cost metric, with navigation costs derived from (travel_cost ?from ?to) and atomic manipulation steps costing 1.
This unified domain supports mobile manipulation across multi-room environments with arbitrary topology, door traversals, and heterogeneous objects (Ye et al., 9 Feb 2026).
2. Visual-Topological Map Construction
UniPlan constructs an explicit world model as a visual-topological graph:
- Node Types: Fine-grained pose nodes (waypoints), room nodes, and asset nodes (furniture, devices).
- Adjacency and Traversal: An adjacency matrix encodes spatial connectivity, while edge attributes such as has_door and travel_cost specify traversability and path costs.
- Visual Anchoring: Each asset node anchors real-world scene images, enabling image-based grounding.
This abstraction captures both navigational structure and grounding information for symbolic and language-guided reasoning. The visual-topological map provides the substrate for task localization, connection of semantic and geometric information, and context for scene understanding (Ye et al., 9 Feb 2026).
3. Language and Perception Grounding
Given a natural-language instruction, UniPlan performs a two-stage grounding process:
- Task-Oriented Node Retrieval: A LLM ranks and retrieves a relevant node subset from a textual index describing each asset, using cosine similarity in embedding space.
- Visual-LLM Grounding: For each selected node, a vision-LLM (VLM) is prompted with the instruction and an anchor image, generating PDDL-compliant symbol sets, initial predicates, and goals as JSON objects.
- Example VLM prompt structure:
"You are an expert PDDL problem generator. Here is the instruction T, here is image I_n, here is the PDDL domain. Output JSON objects, init-predicates, and goal-predicates."
- Example VLM prompt structure:
The output is a fully grounded PDDL problem instance, with objects, initial state, and goal fluents reflecting the perceptually available world as parsed by vision-LLMs (Ye et al., 9 Feb 2026).
4. Map Compression, Reconnection, and Efficient Planning
To reduce planning complexity while preserving correctness:
- Graph Compression: The system extracts a dense subgraph over the set of retrieved nodes (and robot start pose), computing all-pairs shortest path costs (Dijkstra) and storing atomic move sequences and door-crossing metadata.
- Planning Over Compressed Topology: The result is a complete graph on key nodes for efficient, abstract plan search. Any high-level move action is later expanded into concrete waypoint sequences, with open_door actions precisely injected before traversing doors.
- Planning Execution: The unified, grounded problem is provided to a standard metric-optimal PDDL solver (e.g., Fast-Downward with A*-LMcut), with cost-minimization over navigation and manipulation (Ye et al., 9 Feb 2026).
- Example Output: For a multi-step instruction like "Prepare two cups of coffee and place them on the meeting table," resulting plans interleave navigation, perception-triggered manipulation, and door operations, and are cost-optimal in the compressed state space.
5. Experimental Evaluation and Benchmarks
UniPlan has been evaluated in simulated campus-scale environments (over 50 tasks, with difficulty stratification):
- Baselines: LLM-as-Planner (direct autoregressive action generation); SayPlan (hierarchical PDDL planning); DELTA (autoregressive subgoal planning).
- Metrics: Success rate (full goal achievement), cumulative plan cost (sum of travel_cost and per-manipulation unit cost), and planning time.
- Performance:
- UniPlan: 90.2% ± 2.5% success, cost 48.3 ± 5.1, time 3.8s ± 1.2s
- SayPlan: 65.0% ± 4.3% success, cost 75.6 ± 7.8, time 5.1s ± 1.8s
- DELTA: 62.5% ± 3.9% success, cost 80.1 ± 6.5, time 7.2s ± 2.4s
- LLM-as-Planner: 35.0% ± 5.0% success, cost 102 ± 15, time 1.2s ± 0.4s
UniPlan's higher success rate and lower cost are statistically significant (p<0.01) (Ye et al., 9 Feb 2026).
6. Impact, Limitations, and Future Directions
UniPlan demonstrates that:
- A single, holistically-extended PDDL domain can represent and solve general mobile-manipulation tasks in large, visually-grounded environments.
- Integration of symbolic planning, vision-language grounding, and topology compression enables robust and efficient task execution.
- The approach generalizes beyond tabletop domains to arbitrary-scale scenes with navigation, manipulation, and complex spatial arrangements.
Limitations include reliance on the quality of visual-language grounding and possible scaling issues with map/graph size. Future work may focus on closed-loop execution, improved uncertainty handling, or dynamic map updating.
7. Comparison Within the Broader Landscape
"UniPlan" should not be confused with:
- U-Plan: a classical hierarchical planner handling incomplete and uncertain information via Dempster-Shafer intervals and plan reapplication/merging (Mansell et al., 2013, Mansell, 2013).
- UniPlane: a unified neural plane detection and reconstruction system for 3D geometry from videos (Huang et al., 2024).
- UniPlanner: a multi-dataset integrated planner for autonomous vehicles (Yang et al., 28 Oct 2025).
- UPP: Unified Path Planner for grid-based safety-optimal motion (Arora et al., 29 May 2025).
The defining feature of UniPlan in vision-language task planning is the explicit, programmatic extension of PDDL domains for navigation and manipulation, tightly coupled with map-based and language-visual grounding for long-horizon mobile manipulation (Ye et al., 9 Feb 2026).