Conditional Neural Movement Primitives
- Conditional Neural Movement Primitives (CNMPs) are conditional, probabilistic auto-encoders that model continuous robot trajectories from demonstration data.
- Vector-Quantized CNMP (VQ-CNMP) introduces a discrete skill bottleneck to cluster demonstrations into symbolic skills while preserving trajectory execution fidelity.
- The neuro-symbolic planning pipeline integrates high-level LLM-based symbolic planning with gradient-based low-level refinement for precise robotic control.
Conditional Neural Movement Primitives (CNMPs) are conditional, probabilistic auto-encoders that model continuous time-series such as robot trajectories. The CNMP framework is designed for learning from demonstration, enabling unsupervised discovery and encoding of complex skills, as well as providing high-fidelity reconstruction and inference of trajectories. Vector-Quantized CNMP (VQ-CNMP) augments the CNMP architecture with a discrete skill bottleneck, supporting unsupervised clustering of demonstrations into symbolic skills while retaining the expressiveness of continuous representations for trajectory execution. VQ-CNMP serves as the backbone of a neuro-symbolic bi-level planning pipeline, enabling both symbolic high-level planning with LLMs and differentiable low-level trajectory refinement via gradient-based optimization (Aktas et al., 2024).
1. Foundations of Conditional Neural Movement Primitives
CNMPs represent an unsupervised sequence modeling framework for robot trajectories, operational over datasets of demonstrations where each demonstration is a time-ordered series of sensorimotor pairs:
Each is encoded via a neural function , yielding pointwise latents :
Contextual aggregation is performed by averaging these latents:
The aggregated latent is concatenated with a query timestamp and passed to a decoder , producing the parameters of a Gaussian distribution over the sensorimotor state:
This defines the conditional likelihood of the state at time :
The training objective is the negative log-likelihood over held-out queries,
This configuration enables CNMPs to learn flexible, temporally-conditioned generative models from unordered demonstration segments.
2. Vector-Quantized CNMP: Discrete Skill Bottleneck
VQ-CNMP introduces a discrete codebook , , creating a symbolic bottleneck for skill discovery. After context aggregation, the continuous latent is quantized:
Decoding then proceeds as in the original architecture:
The VQ-CNMP loss combines the CNMP likelihood with the vector-quantization and commitment penalties:
where denotes the stop-gradient operator and is a small constant (standard choice ).
Codebook optimization employs standard SGD, updating each codeword towards encoder outputs selected during the quantization step.
3. Self-Supervised Fine-Tuning for Latent Precision
Following initial unsupervised training, each trajectory demonstration is assigned a codebook index . Demonstrations are clustered by their discrete assignments. A second CNMP is then re-trained in a fully supervised regime: trajectories from each cluster are encoded and decoded with the corresponding fixed latent . This procedure greatly enhances trajectory reconstruction precision and ensures higher success rates in downstream low-level planning tasks.
4. Neuro-Symbolic Bi-Level Planning Pipeline
The bi-level pipeline exploits the structure learned in VQ-CNMP, integrating symbolic planning at the skill level with continuous refinement for physical task execution.
- High-Level Skill Planning: Each codebook vector is associated with a skill description (e.g., "take tomato from left cupboard and add to pan"). A multi-modal LLM (e.g., ChatGPT-4o) is prompted with an overhead image, the list of skill descriptions, and an NL-formulated goal. The LLM returns an ordered list of skill indices to achieve the specified goal.
- Low-Level Trajectory Planning (Gradient-Based Optimization): For each high-level step , the latent is refined to align trajectory predictions with precise environmental constraints (e.g., object positions). The loss at contact time is
with as the target 3D position. Gradient descent is applied to :
Refinement iterates until convergence (), after which the decoder is used to generate the full low-level trajectory for the skill step. Final plans concatenate these trajectories sequentially.
5. Modifications to Inference and Comparison to Standard CNMP
At inference, standard CNMPs encode context and decode from the continuous latent , preserving the original trajectory distribution. In VQ-CNMP, inference defaults to hard quantization: the nearest codebook vector is used for decoding, optionally refined through the aforementioned gradient-based update for task-specific adjustments. Post self-supervised fine-tuning, inference rolls in using the assigned directly, eschewing nearest-neighbor search in favor of deterministic encoding.
This bifurcation between standard and vector-quantized inference regimes enables flexible trade-offs between symbolic skill abstraction and low-level control fidelity.
6. Empirical Validation: Skill Discovery, Symbolic Planning, and Control
Experiments were performed in a simulated kitchen with five parametrized skills, each with 100 demonstration trajectories and randomized object placements. Key quantitative results include:
- Skill Discovery: Training 100 VQ-CNMPs with codewords, perfect cluster-to-skill mapping was achieved in 27% of runs; mean clustering accuracy was 0.97. Increasing above the true number of skills gradually reduced clustering performance, as excess codes subdivide true skill classes.
- Skill Labeling by LLM: Using decoded trajectories as visual prompts for ChatGPT-4o, labeling accuracy with five consecutive snapshots was approximately 60% (randomized snapshots up to 71% for "pickup tomato"), indicating incomplete but non-trivial recognition by state-of-the-art LLMs.
- High-Level Planning Accuracy: Given only the five required skills, plans generated for assembling "stew of X, Y, Z" had 59.7% symbolic correctness. Introducing distractor skills (relevant or irrelevant) decreased performance, but explicit provision of object locations and hidden-object skills increased success up to 98.4%.
- Low-Level Trajectory Execution: Self-supervised fine-tuned VQ-CNMP significantly exceeded unsupervised baselines in control success (e.g., for "pickup": 100% vs. 80-90%, for "put-in-pan": 80-90% vs. 0%), with convergence times of ~70 versus ~400 gradient steps.
| Skill | Pickup (unsup.) | Pickup (self-sup.) | Put-in-pan (unsup.) | Put-in-pan (self-sup.) |
|---|---|---|---|---|
| Tomato | 80% | 100% | 0% | 90% |
| Mushroom | 40% | 100% | 0% | 85% |
| Potato | 0% | 90% | 0% | 80% |
| Salt | 40% | 100% | 0% | 85% |
| Oil | 90% | 100% | 0% | 80% |
This combination of discrete skill discovery, symbolic planning, and differentiable trajectory refinement illustrates the capacity of VQ-CNMP for integrated neuro-symbolic planning in continuous control domains (Aktas et al., 2024).
7. Significance and Prospects
VQ-CNMP extends the expressiveness of CNMPs by introducing discrete, interpretable skill symbols, supporting unsupervised skill abstraction and reliable low-level action planning. Self-supervised fine-tuning resolves cluster assignment ambiguities and increases trajectory precision. The bi-level pipeline unifies symbolic reasoning and differentiable control, facilitating scalable skill discovery, labeling, planning, and execution in complex, sensorimotor environments.
A plausible implication is that future neuro-symbolic planners may further enhance generalization by incorporating active skill discovery, context-driven codebook adaptation, or closed-loop integration with multimodal LLMs for semantically grounded task specification and robust execution in unstructured environments (Aktas et al., 2024).