Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

Published 16 May 2026 in cs.RO, cs.AI, and cs.LG | (2605.16863v1)

Abstract: Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents XDiffuser, which separates extrinsic graph planning from local diffusion to ensure globally coherent trajectory generation.
It details a three-stage method: constructing connectivity graphs using temporal distance, planning waypoints via classical search, and guiding diffusion with these waypoints.
Empirical results demonstrate significant success rate improvements and scalability in complex settings like multi-agent path planning and inspection tasks.

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

Overview and Motivation

The paper "Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning" (2605.16863) presents a new approach for long-horizon trajectory planning in robots and agents using diffusion models. The core challenge addressed is the extrapolation from short, often suboptimal, offline trajectory data to global, long-horizon problem settings, where naïve local generative stitching fails to ensure global consistency. To mitigate the limitations of purely local inference and compositional diffusion (which face significant obstacles as horizon length and task complexity grow), the authors introduce XDiffuser—an architecture that explicitly separates high-level plan search from local trajectory synthesis.

Methodological Contributions

The principal methodological innovation in XDiffuser is the use of an extrinsically-guided planning phase, performed outside the iterative diffusion/denoising process. This phase constructs a graph over dataset states using temporal distance representation (TDR) to capture coarse connectivity. Classical graph search algorithms, optionally customized for specific combinatorial constraints (e.g., MAPF, TSP), are then employed to generate a global waypoint scaffold for downstream trajectory synthesis.

XDiffuser operates in three stages:

Connectivity Graph Construction: Offline, a graph is formed with vertices as sampled dataset states and edges weighted by learned temporal distances. This design encodes feasible short-range transitions while ignoring motion dynamics or local cost details.
Planning via Graph Algorithms: At inference, task-specific vertices (start, goal, POIs) are grafted onto the graph. Graph search (shortest path, prioritized, or covering tour) yields a structured route, with sparse waypoints determined by downsampling.
Waypoint-Guided Diffusion: The precomputed waypoint sequence serves as a soft energy/constraint to guide the compositional diffusion process, enforcing global structure during iterative trajectory denoising. This mechanism ensures generated trajectories are not only locally consistent but also globally coherent.

This decomposition allows XDiffuser to retain the sample efficiency and high-dimensional expressiveness of diffusion trajectory planners, while delegating global exploration and combinatorial reasoning to efficient, classical search procedures.

Empirical Evaluation

The paper offers comprehensive empirical validation on the OGBench suite, as well as in multi-agent path finding (MAPF) and inspection planning benchmarks. Across these diverse tasks, XDiffuser exhibits several key results:

Goal-Reaching Performance: On challenging long-horizon goal-reaching tasks such as AntMaze Explore Large, XDiffuser achieves a 98.5% success rate, a more than 70% absolute increase over the base compositional diffusion planner. Notably, XDiffuser consistently outperforms both intrinsic-search-based enhancements (C-MCTD, CDGS) and pure graph-following (GAS) in settings requiring substantial extrapolation and long-horizon coordination.
Modularity and Generalization: By decoupling search from denoising, XDiffuser can reuse the same pretrained diffusion model across different task types. In MAPF, utilizing prioritized graph search, XDiffuser coordinated up to four agents—a regime where all baseline single-agent or soft-guidance variations failed catastrophically (e.g., 4-agent case: 58% success for XDiffuser vs. 13% for population-based diffusion search).
Scalability in Inspection Planning: When integrated with a MILP-based graph IP solver in a drone bridge inspection task, XDiffuser attained over 95% coverage of points-of-interest for all but the smallest instances, vastly outperforming intrinsic-search variants which plateaued well below 50% coverage.
Efficiency: Extrinsic planning confines search complexity to the graph domain, drastically reducing trajectory denoiser queries relative to intrinsic, branching approaches—the computational cost does not scale with the full denoising depth.

Theoretical and Practical Implications

XDiffuser advances the state of the art in several respects:

Separation of Global and Local Reasoning: The architecture formalizes the intuition, long recognized in hierarchical control, that global combinatorial reasoning is orthogonal to local dynamic feasibility. By fully decoupling planning from local inference, XDiffuser circumvents the brittle mode-averaging issues endemic to compositional diffusion and avoids the exponential test-time evaluation cost of embedded search.
Task Flexibility and Zero-Shot Adaptation: Changing only the graph and search goal at test time allows for adaptation to unseen specifications (multi-agent, combinatorial coverage). This property enables rapid deployment and repurposing of the same learned model without retraining or fine-tuning, opening the door to foundation diffusion policies augmented with planning interfaces.
Efficiency: Centralizing expensive search in a sparse, low-diameter graph enables scaling to higher agent counts, longer horizons, and combinatorial objectives otherwise infeasible for full trajectory-level search.
Data Regime Robustness: Particularly in low-quality or suboptimal data regimes, XDiffuser provides substantial gains, as the global plan can circumvent local training-data idiosyncrasies.

Limitations and Future Directions

XDiffuser's reliance on simple, uniform state sampling for graph construction and symmetric, undirected connectivity implies limitations in settings with high-dimensional, sparse, or stochastic dynamics. The method's effectiveness hinges on maintaining global graph connectivity and capturing true temporal reachability; broken or poorly-connected graphs can degrade solution quality. For systems with strong directionality or state-conditional reachability, more principled, uncertainty-aware, or learned graph abstractions are warranted. Moreover, integrating learned feasibility or likelihood scores directly on edges—beyond temporal distance—could further enhance downstream guidance and robustness.

An open avenue is to better couple the extrinsic planning layer and the diffusion model, possibly via end-to-end training of edge costs or integrating uncertainty estimates, and to investigate the approach in environments with partial observability or non-stationarity.

Conclusion

Plan First, Diffuse Later (2605.16863) contributes a modular, efficient planning paradigm that systemically elevates classical global search to provide coarse plans and leverages modern compositional diffusion models for synthesizing locally feasible, smooth, and data-consistent trajectories. The strong empirical performance in challenging, varied settings, especially in the low-data regime and in combinatorial zero-shot tasks, underscores the practical viability of this approach for scalable, long-horizon planning in robotics. The separation of concerns—planning for global structure, diffusion for local realism—establishes a blueprint for combining generative models with discrete search, with significant implications for the design of future AI planning systems.

Markdown Report Issue