UniPlan: Unified Task & Motion Planning

Updated 10 February 2026

UniPlan is a unified task and motion planning framework that integrates extended PDDL domains, visual-topological mapping, and language grounding to enable robust mobile manipulation.
The system extends basic PDDL with spatial graph embedding, bimanual coordination, and cost optimization, supporting complex navigation and manipulation in large, indoor environments.
Experimental evaluations show UniPlan achieves over 90% success with lower planning costs and efficient execution in campus-scale scenarios compared to baseline methods.

UniPlan refers to a class of unified task and motion planning frameworks in robotics, artificial intelligence, and operations research, with recent instantiations notably in vision-language-based mobile manipulation planning. These frameworks are characterized by their integration of symbolic planning (typically using PDDL), high-level domain and environment representations, and vision-language grounding to connect perceptual inputs with task specifications. The most prominent modern system titled UniPlan combines extended mobile-manipulation PDDL domains, visual-topological map representations, language and visual grounding, and efficient multi-step planning in large indoor environments (Ye et al., 9 Feb 2026).

1. Unified Domain Representation: Extended Mobile-Manipulation PDDL

UniPlan leverages the Planning Domain Definition Language (PDDL) as the substrate for encoding robot capabilities and environmental structure at scale. The baseline mobile-manipulation domain (including actions such as pick, place, open, and pour) is programmatically extended to support:

Spatial Graph Embedding: Addition of predicates such as (robot_at_node ?r ?n), (object_at_node ?o ?n), (has_door ?n1 ?n2), (connected ?n1 ?n2), and a numeric function (travel_cost ?n1 ?n2) allow the explicit modeling of navigation topology, door connectivity, and cost structure within large buildings.

Navigation and Traversability Actions: New operator schemas such as:

(:action move_robot
 :parameters (?r ?from ?to)
 :precondition (and (robot_at_node ?r ?from) (connected ?from ?to))
 :effect (and (robot_at_node ?r ?to) (not (robot_at_node ?r ?from)) (increase (total-cost) (travel_cost ?from ?to)))
)
(:action open_door
 :parameters (?r ?hand ?from ?to)
 :precondition (and (robot_has_hand ?r ?hand) (robot_at_node ?r ?from) (has_door ?from ?to) (hand_free ?r ?hand) (not (connected ?from ?to)))
 :effect (and (connected ?from ?to) (connected ?to ?from) (increase (total-cost) 1))
)

are introduced to support moving through a graph and manipulating the environment.

Bimanual Coordination: All manipulation predicates and effects are generalized to account for multiple end-effectors per robot, enabling two-handed coordinated tasks.
Cost Optimization: All actions increment a global (total-cost) variable; planners are instructed to minimize this cost metric, with navigation costs derived from (travel_cost ?from ?to) and atomic manipulation steps costing 1.

This unified domain supports mobile manipulation across multi-room environments with arbitrary topology, door traversals, and heterogeneous objects (Ye et al., 9 Feb 2026).

2. Visual-Topological Map Construction

UniPlan constructs an explicit world model as a visual-topological graph:

Node Types: Fine-grained pose nodes (waypoints), room nodes, and asset nodes (furniture, devices).
Adjacency and Traversal: An adjacency matrix encodes spatial connectivity, while edge attributes such as has_door and travel_cost specify traversability and path costs.
Visual Anchoring: Each asset node anchors real-world scene images, enabling image-based grounding.

This abstraction captures both navigational structure and grounding information for symbolic and language-guided reasoning. The visual-topological map provides the substrate for task localization, connection of semantic and geometric information, and context for scene understanding (Ye et al., 9 Feb 2026).

3. Language and Perception Grounding

Given a natural-language instruction, UniPlan performs a two-stage grounding process:

Task-Oriented Node Retrieval: A LLM ranks and retrieves a relevant node subset from a textual index describing each asset, using cosine similarity in embedding space.
Visual-LLM Grounding: For each selected node, a vision-LLM (VLM) is prompted with the instruction and an anchor image, generating PDDL-compliant symbol sets, initial predicates, and goals as JSON objects.
- Example VLM prompt structure: "You are an expert PDDL problem generator. Here is the instruction T, here is image I_n, here is the PDDL domain. Output JSON objects, init-predicates, and goal-predicates."

The output is a fully grounded PDDL problem instance, with objects, initial state, and goal fluents reflecting the perceptually available world as parsed by vision-LLMs (Ye et al., 9 Feb 2026).

4. Map Compression, Reconnection, and Efficient Planning

To reduce planning complexity while preserving correctness:

Graph Compression: The system extracts a dense subgraph over the set of retrieved nodes (and robot start pose), computing all-pairs shortest path costs (Dijkstra) and storing atomic move sequences and door-crossing metadata.
Planning Over Compressed Topology: The result is a complete graph on key nodes for efficient, abstract plan search. Any high-level move action is later expanded into concrete waypoint sequences, with open_door actions precisely injected before traversing doors.
Planning Execution: The unified, grounded problem is provided to a standard metric-optimal PDDL solver (e.g., Fast-Downward with A*-LMcut), with cost-minimization over navigation and manipulation (Ye et al., 9 Feb 2026).
Example Output: For a multi-step instruction like "Prepare two cups of coffee and place them on the meeting table," resulting plans interleave navigation, perception-triggered manipulation, and door operations, and are cost-optimal in the compressed state space.

5. Experimental Evaluation and Benchmarks

UniPlan has been evaluated in simulated campus-scale environments (over 50 tasks, with difficulty stratification):

Baselines: LLM-as-Planner (direct autoregressive action generation); SayPlan (hierarchical PDDL planning); DELTA (autoregressive subgoal planning).
Metrics: Success rate (full goal achievement), cumulative plan cost (sum of travel_cost and per-manipulation unit cost), and planning time.
Performance:
- UniPlan: 90.2% ± 2.5% success, cost 48.3 ± 5.1, time 3.8s ± 1.2s
- SayPlan: 65.0% ± 4.3% success, cost 75.6 ± 7.8, time 5.1s ± 1.8s
- DELTA: 62.5% ± 3.9% success, cost 80.1 ± 6.5, time 7.2s ± 2.4s
- LLM-as-Planner: 35.0% ± 5.0% success, cost 102 ± 15, time 1.2s ± 0.4s

UniPlan's higher success rate and lower cost are statistically significant (p<0.01) (Ye et al., 9 Feb 2026).

6. Impact, Limitations, and Future Directions

UniPlan demonstrates that:

A single, holistically-extended PDDL domain can represent and solve general mobile-manipulation tasks in large, visually-grounded environments.
Integration of symbolic planning, vision-language grounding, and topology compression enables robust and efficient task execution.
The approach generalizes beyond tabletop domains to arbitrary-scale scenes with navigation, manipulation, and complex spatial arrangements.

Limitations include reliance on the quality of visual-language grounding and possible scaling issues with map/graph size. Future work may focus on closed-loop execution, improved uncertainty handling, or dynamic map updating.

7. Comparison Within the Broader Landscape

"UniPlan" should not be confused with:

U-Plan: a classical hierarchical planner handling incomplete and uncertain information via Dempster-Shafer intervals and plan reapplication/merging (Mansell et al., 2013, Mansell, 2013).
UniPlane: a unified neural plane detection and reconstruction system for 3D geometry from videos (Huang et al., 2024).
UniPlanner: a multi-dataset integrated planner for autonomous vehicles (Yang et al., 28 Oct 2025).
UPP: Unified Path Planner for grid-based safety-optimal motion (Arora et al., 29 May 2025).

The defining feature of UniPlan in vision-language task planning is the explicit, programmatic extension of PDDL domains for navigation and manipulation, tightly coupled with map-based and language-visual grounding for long-horizon mobile manipulation (Ye et al., 9 Feb 2026).