- The paper presents XDiffuser, which separates extrinsic graph planning from local diffusion to ensure globally coherent trajectory generation.
- It details a three-stage method: constructing connectivity graphs using temporal distance, planning waypoints via classical search, and guiding diffusion with these waypoints.
- Empirical results demonstrate significant success rate improvements and scalability in complex settings like multi-agent path planning and inspection tasks.
Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning
Overview and Motivation
The paper "Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning" (2605.16863) presents a new approach for long-horizon trajectory planning in robots and agents using diffusion models. The core challenge addressed is the extrapolation from short, often suboptimal, offline trajectory data to global, long-horizon problem settings, where naรฏve local generative stitching fails to ensure global consistency. To mitigate the limitations of purely local inference and compositional diffusion (which face significant obstacles as horizon length and task complexity grow), the authors introduce XDiffuserโan architecture that explicitly separates high-level plan search from local trajectory synthesis.
Methodological Contributions
The principal methodological innovation in XDiffuser is the use of an extrinsically-guided planning phase, performed outside the iterative diffusion/denoising process. This phase constructs a graph over dataset states using temporal distance representation (TDR) to capture coarse connectivity. Classical graph search algorithms, optionally customized for specific combinatorial constraints (e.g., MAPF, TSP), are then employed to generate a global waypoint scaffold for downstream trajectory synthesis.
XDiffuser operates in three stages:
- Connectivity Graph Construction: Offline, a graph is formed with vertices as sampled dataset states and edges weighted by learned temporal distances. This design encodes feasible short-range transitions while ignoring motion dynamics or local cost details.
- Planning via Graph Algorithms: At inference, task-specific vertices (start, goal, POIs) are grafted onto the graph. Graph search (shortest path, prioritized, or covering tour) yields a structured route, with sparse waypoints determined by downsampling.
- Waypoint-Guided Diffusion: The precomputed waypoint sequence serves as a soft energy/constraint to guide the compositional diffusion process, enforcing global structure during iterative trajectory denoising. This mechanism ensures generated trajectories are not only locally consistent but also globally coherent.
This decomposition allows XDiffuser to retain the sample efficiency and high-dimensional expressiveness of diffusion trajectory planners, while delegating global exploration and combinatorial reasoning to efficient, classical search procedures.
Empirical Evaluation
The paper offers comprehensive empirical validation on the OGBench suite, as well as in multi-agent path finding (MAPF) and inspection planning benchmarks. Across these diverse tasks, XDiffuser exhibits several key results:
- Goal-Reaching Performance: On challenging long-horizon goal-reaching tasks such as AntMaze Explore Large, XDiffuser achieves a 98.5% success rate, a more than 70% absolute increase over the base compositional diffusion planner. Notably, XDiffuser consistently outperforms both intrinsic-search-based enhancements (C-MCTD, CDGS) and pure graph-following (GAS) in settings requiring substantial extrapolation and long-horizon coordination.
- Modularity and Generalization: By decoupling search from denoising, XDiffuser can reuse the same pretrained diffusion model across different task types. In MAPF, utilizing prioritized graph search, XDiffuser coordinated up to four agentsโa regime where all baseline single-agent or soft-guidance variations failed catastrophically (e.g., 4-agent case: 58% success for XDiffuser vs. 13% for population-based diffusion search).
- Scalability in Inspection Planning: When integrated with a MILP-based graph IP solver in a drone bridge inspection task, XDiffuser attained over 95% coverage of points-of-interest for all but the smallest instances, vastly outperforming intrinsic-search variants which plateaued well below 50% coverage.
- Efficiency: Extrinsic planning confines search complexity to the graph domain, drastically reducing trajectory denoiser queries relative to intrinsic, branching approachesโthe computational cost does not scale with the full denoising depth.
Theoretical and Practical Implications
XDiffuser advances the state of the art in several respects:
- Separation of Global and Local Reasoning: The architecture formalizes the intuition, long recognized in hierarchical control, that global combinatorial reasoning is orthogonal to local dynamic feasibility. By fully decoupling planning from local inference, XDiffuser circumvents the brittle mode-averaging issues endemic to compositional diffusion and avoids the exponential test-time evaluation cost of embedded search.
- Task Flexibility and Zero-Shot Adaptation: Changing only the graph and search goal at test time allows for adaptation to unseen specifications (multi-agent, combinatorial coverage). This property enables rapid deployment and repurposing of the same learned model without retraining or fine-tuning, opening the door to foundation diffusion policies augmented with planning interfaces.
- Efficiency: Centralizing expensive search in a sparse, low-diameter graph enables scaling to higher agent counts, longer horizons, and combinatorial objectives otherwise infeasible for full trajectory-level search.
- Data Regime Robustness: Particularly in low-quality or suboptimal data regimes, XDiffuser provides substantial gains, as the global plan can circumvent local training-data idiosyncrasies.
Limitations and Future Directions
XDiffuser's reliance on simple, uniform state sampling for graph construction and symmetric, undirected connectivity implies limitations in settings with high-dimensional, sparse, or stochastic dynamics. The method's effectiveness hinges on maintaining global graph connectivity and capturing true temporal reachability; broken or poorly-connected graphs can degrade solution quality. For systems with strong directionality or state-conditional reachability, more principled, uncertainty-aware, or learned graph abstractions are warranted. Moreover, integrating learned feasibility or likelihood scores directly on edgesโbeyond temporal distanceโcould further enhance downstream guidance and robustness.
An open avenue is to better couple the extrinsic planning layer and the diffusion model, possibly via end-to-end training of edge costs or integrating uncertainty estimates, and to investigate the approach in environments with partial observability or non-stationarity.
Conclusion
Plan First, Diffuse Later (2605.16863) contributes a modular, efficient planning paradigm that systemically elevates classical global search to provide coarse plans and leverages modern compositional diffusion models for synthesizing locally feasible, smooth, and data-consistent trajectories. The strong empirical performance in challenging, varied settings, especially in the low-data regime and in combinatorial zero-shot tasks, underscores the practical viability of this approach for scalable, long-horizon planning in robotics. The separation of concernsโplanning for global structure, diffusion for local realismโestablishes a blueprint for combining generative models with discrete search, with significant implications for the design of future AI planning systems.