Generative Skill Chaining: Diffusion-Based Planning
- Generative Skill Chaining (GSC) is a probabilistic framework that uses diffusion generative models to learn parameterized skills for long-horizon manipulation tasks.
- It composes independently trained skill priors via a product-of-experts approach, integrating classifier-based constraint guidance for efficient, constraint-aware planning.
- GSC demonstrates improved success rates and sample efficiency in both simulated and real-world manipulation tasks, validated with robust diffusion model architectures.
Generative Skill Chaining (GSC) is a probabilistic framework for long-horizon manipulation planning that addresses the synthesis of complex action sequences from a learned library of parameterized skills. Unlike traditional search-based or greedy skill chaining, GSC composes skill-centric diffusion models in parallel to enable efficient, constraint-aware plan generation across previously unseen task instances. GSC prioritizes generalizability and sample efficiency by combining the expressivity of diffusion generative modeling with principled probabilistic inference and classifier-based constraint guidance (Mishra et al., 2023).
1. Formalization of the Long-Horizon Skill Planning Problem
GSC operates in a continuous state space , where a state encodes the configuration of robotic agents and objects, typically with 6-DoF poses for each rigid body. The agent is endowed with a finite library of parameterized skills, each mapping a continuous parameter and a current state into a stochastic transition .
Given at test time:
- : initial state,
- : goal predicate (e.g., ),
- : a “skeleton” specifying the ordered instantiation of skills and associated objects,
- : differentiable geometric or logical constraints.
The planning objective is to produce a sequence of parameters and corresponding states satisfying:
- ,
- ,
- .
GSC assumes the symbolic skeleton is given, aiming to efficiently infer the continuous parameters by leveraging diffusion-based skill priors.
2. GSC Framework Overview
GSC is structured in two primary phases:
A. Offline Skill-Prior Training (per skill):
- For each skill , collect (about 5,000 samples per skill) from expert or exploration-driven demonstrations.
- Train an unconditional diffusion model over concatenated triplets .
- Learn a time-indexed score network via denoising score matching.
B. Test-Time Chaining and Inference:
- Given , define a joint block vector .
- Formulate sampling from the composed skill priors using a product-of-experts approach, integrating each independently-trained skill score and adding differentiable constraint gradients.
- Execute a parallel reverse diffusion chain of length over to sample a candidate trajectory.
- Optionally leverage a skill-success predictor for post-sampling validation and local replanning.
3. Mathematical Foundations
3.1. Diffusion Model for Skill Priors:
For each skill, the forward SDE on data triplets is: The score network is trained to minimize: During inference, denoising is performed as: followed by reverse sampling.
3.2. Skill Chaining via Product-of-Experts:
Let . The joint, unnormalized density is
The composite score for each reverse step is
balances forward and backward consistency at each skill interface; is typical.
3.3. Constraint Handling by Classifier Guidance:
For a soft constraint , classifier guidance augments the overall score as
and is weighted by a hyperparameter .
4. Inference Algorithm and Planning Procedure
GSC's inference executes a single parallel reverse diffusion chain to sample full -step plans. The process is as follows:
1 2 3 4 5 6 7 8 9 |
Input: s0, Φ=(π_1,...,π_K), {ε_{π_i}}, {h_j, α_j}, T, {σ_t}, {γ_i}
Initialize: sample X_T ~ N(0, I)
for t = T down to 1:
decompose X_t into (s_t^{(0)}, a_t^{(0)}, ..., s_t^{(K)})
S_t = sum_i [ε_{π_i}(s_t^{(i-1)}, a_t^{(i-1)}, s_t^{(i)}, t) - γ_i * ε_{π_i}(s_t^{(i)}, t)]
+ sum_j α_j * ∇_{X_t} log h_j(tilde_X_0)
tilde_X_0 = X_t + σ_t * S_t
X_{t-1} ~ N(tilde_X_0, σ_{t-1}^2 I)
Return X_0 = (s^{(0)}, a^{(0)}, ..., s^{(K)}) |
Post-processing involves validating sampled trajectories with an auxiliary skill-success predictor and, if a step fails, replanning from that step onward. This approach enables robust handling of perturbations and modular replanning.
5. Skill Prior Training and Model Architecture
For each skill , training uses successful transitions . Skill score networks adopt a DiT-style (transformer-based) architecture with:
- Hidden size 128,
- 4 blocks,
- 4 attention heads,
- MLP ratio 4,
- Dropout 0.1.
Loss is the standard denoising score-matching objective. Training runs approximately 100 epochs with Adam optimizer and learning rate .
6. Experimental Evaluation and Comparative Analysis
GSC was evaluated on three long-horizon manipulation task suites: Hook Reach, Constrained Packing, and Rearrangement Push, with skeleton lengths 4–8. Each configuration was tested over 100 randomized environments. Baseline comparisons included:
- Random CEM (uniform prior),
- STAP (policy-CEM with learned prior),
- DAF (generalization-oriented skill chaining).
Performance metrics indicate GSC achieves success rates comparable or superior to baselines, with strictly lower search cost and no per-task retraining. For example, in Rearrangement Push Task 2 (6-step skeleton), success rates were:
- Random CEM: 0.10,
- STAP: 0.52,
- GSC: 0.60.
Ablation experiments on constraint guidance demonstrated that adding task-specific constraints (e.g., maximizing inter-place pose separation) improved success from 0.50–0.80 (no guidance) to 1.00 on 2 out of 3 packing tasks.
Real-world hardware validation was performed on a Franka Panda arm with RealSense depth sensing and 6-DoF scene estimation via AprilTags. Plans generated by GSC in simulation were executed open-loop, and GSC demonstrated the capacity to replan in response to small object pose perturbations.
7. Strengths, Limitations, and Potential Extensions
Strengths:
- Multi-modal generative capacity enables sampling of diverse, plausible skill transitions.
- Compositionality: generalizes to arbitrary-length skill chains without retraining on full plans.
- Parallel inference: joint sampling over steps mitigates exponential search explosion.
- Flexibility: constraint guidance enables task-constrained and geometry-aware planning at inference.
Limitations:
- Assumes a given skeleton ; symbolic skeleton generation (full TAMP) is out of scope.
- Requires complete, low-dimensional state observability.
- Depends on curated demonstration data for every primitive; does not support unsupervised skill discovery.
- Trade-off between solution quality and inference time set by the number of diffusion steps.
Proposed Extensions:
- Learning joint distributions over skeletons and continuous actions via macro diffusion models.
- End-to-end pixel-level diffusion models for vision-to-skill planning.
- Online adaptation of skill priors from observed failures in deployment.
- Accounting for dynamics uncertainty through stochastic reverse SDEs.
GSC introduces a new methodology for long-horizon skill planning by training per-skill diffusion-based priors and composing them via product-of-experts and classifier guidance, providing generalizable, efficient, and constraint-aware manipulation planning without combinatorial search (Mishra et al., 2023).