Conditional Neural Movement Primitives

Updated 14 March 2026

Conditional Neural Movement Primitives (CNMPs) are conditional, probabilistic auto-encoders that model continuous robot trajectories from demonstration data.
Vector-Quantized CNMP (VQ-CNMP) introduces a discrete skill bottleneck to cluster demonstrations into symbolic skills while preserving trajectory execution fidelity.
The neuro-symbolic planning pipeline integrates high-level LLM-based symbolic planning with gradient-based low-level refinement for precise robotic control.

Conditional Neural Movement Primitives (CNMPs) are conditional, probabilistic auto-encoders that model continuous time-series such as robot trajectories. The CNMP framework is designed for learning from demonstration, enabling unsupervised discovery and encoding of complex skills, as well as providing high-fidelity reconstruction and inference of trajectories. Vector-Quantized CNMP (VQ-CNMP) augments the CNMP architecture with a discrete skill bottleneck, supporting unsupervised clustering of demonstrations into symbolic skills while retaining the expressiveness of continuous representations for trajectory execution. VQ-CNMP serves as the backbone of a neuro-symbolic bi-level planning pipeline, enabling both symbolic high-level planning with LLMs and differentiable low-level trajectory refinement via gradient-based optimization (Aktas et al., 2024).

1. Foundations of Conditional Neural Movement Primitives

CNMPs represent an unsupervised sequence modeling framework for robot trajectories, operational over datasets of demonstrations $\mathcal{D} = \{ D_1, \ldots, D_M \},$ where each demonstration $D_j$ is a time-ordered series of sensorimotor pairs:

$D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$

Each $(t_i, x_i)$ is encoded via a neural function $E_\theta$ , yielding pointwise latents $z_i$ :

$z_i = E_\theta(t_i, x_i).$

Contextual aggregation is performed by averaging these latents:

$z_e = \frac{1}{n} \sum_{i=1}^n z_i.$

The aggregated latent $z_e$ is concatenated with a query timestamp $t^*$ and passed to a decoder $D_j$ 0, producing the parameters of a Gaussian distribution over the sensorimotor state:

$D_j$ 1

This defines the conditional likelihood of the state $D_j$ 2 at time $D_j$ 3:

$D_j$ 4

The training objective is the negative log-likelihood over held-out queries,

$D_j$ 5

This configuration enables CNMPs to learn flexible, temporally-conditioned generative models from unordered demonstration segments.

2. Vector-Quantized CNMP: Discrete Skill Bottleneck

VQ-CNMP introduces a discrete codebook $D_j$ 6, $D_j$ 7, creating a symbolic bottleneck for skill discovery. After context aggregation, the continuous latent $D_j$ 8 is quantized:

$D_j$ 9

Decoding then proceeds as in the original architecture:

$D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 0

The VQ-CNMP loss combines the CNMP likelihood with the vector-quantization and commitment penalties:

$D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 1

where $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 2 denotes the stop-gradient operator and $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 3 is a small constant (standard choice $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 4).

Codebook optimization employs standard SGD, updating each codeword towards encoder outputs selected during the quantization step.

3. Self-Supervised Fine-Tuning for Latent Precision

Following initial unsupervised training, each trajectory demonstration is assigned a codebook index $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 5. Demonstrations are clustered by their discrete assignments. A second CNMP is then re-trained in a fully supervised regime: trajectories from each cluster are encoded and decoded with the corresponding fixed latent $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 6. This procedure greatly enhances trajectory reconstruction precision and ensures higher success rates in downstream low-level planning tasks.

4. Neuro-Symbolic Bi-Level Planning Pipeline

The bi-level pipeline exploits the structure learned in VQ-CNMP, integrating symbolic planning at the skill level with continuous refinement for physical task execution.

High-Level Skill Planning: Each codebook vector $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 7 is associated with a skill description (e.g., "take tomato from left cupboard and add to pan"). A multi-modal LLM (e.g., ChatGPT-4o) is prompted with an overhead image, the list of skill descriptions, and an NL-formulated goal. The LLM returns an ordered list of skill indices $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 8 to achieve the specified goal.
Low-Level Trajectory Planning (Gradient-Based Optimization): For each high-level step $D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.$ 9, the latent $(t_i, x_i)$ 0 is refined to align trajectory predictions with precise environmental constraints (e.g., object positions). The loss at contact time $(t_i, x_i)$ 1 is

$(t_i, x_i)$ 2

with $(t_i, x_i)$ 3 as the target 3D position. Gradient descent is applied to $(t_i, x_i)$ 4:

$(t_i, x_i)$ 5

Refinement iterates until convergence ( $(t_i, x_i)$ 6), after which the decoder is used to generate the full low-level trajectory for the skill step. Final plans concatenate these trajectories sequentially.

5. Modifications to Inference and Comparison to Standard CNMP

At inference, standard CNMPs encode context and decode from the continuous latent $(t_i, x_i)$ 7, preserving the original trajectory distribution. In VQ-CNMP, inference defaults to hard quantization: the nearest codebook vector $(t_i, x_i)$ 8 is used for decoding, optionally refined through the aforementioned gradient-based update for task-specific adjustments. Post self-supervised fine-tuning, inference rolls in using the assigned $(t_i, x_i)$ 9 directly, eschewing nearest-neighbor search in favor of deterministic encoding.

This bifurcation between standard and vector-quantized inference regimes enables flexible trade-offs between symbolic skill abstraction and low-level control fidelity.

6. Empirical Validation: Skill Discovery, Symbolic Planning, and Control

Experiments were performed in a simulated kitchen with five parametrized skills, each with 100 demonstration trajectories and randomized object placements. Key quantitative results include:

Skill Discovery: Training 100 VQ-CNMPs with $E_\theta$ 0 codewords, perfect cluster-to-skill mapping was achieved in 27% of runs; mean clustering accuracy was 0.97. Increasing $E_\theta$ 1 above the true number of skills gradually reduced clustering performance, as excess codes subdivide true skill classes.
Skill Labeling by LLM: Using decoded trajectories as visual prompts for ChatGPT-4o, labeling accuracy with five consecutive snapshots was approximately 60% (randomized snapshots up to 71% for "pickup tomato"), indicating incomplete but non-trivial recognition by state-of-the-art LLMs.
High-Level Planning Accuracy: Given only the five required skills, plans generated for assembling "stew of X, Y, Z" had 59.7% symbolic correctness. Introducing distractor skills (relevant or irrelevant) decreased performance, but explicit provision of object locations and hidden-object skills increased success up to 98.4%.
Low-Level Trajectory Execution: Self-supervised fine-tuned VQ-CNMP significantly exceeded unsupervised baselines in control success (e.g., for "pickup": 100% vs. 80-90%, for "put-in-pan": 80-90% vs. 0%), with convergence times of ~70 versus ~400 gradient steps.

Skill	Pickup (unsup.)	Pickup (self-sup.)	Put-in-pan (unsup.)	Put-in-pan (self-sup.)
Tomato	80%	100%	0%	90%
Mushroom	40%	100%	0%	85%
Potato	0%	90%	0%	80%
Salt	40%	100%	0%	85%
Oil	90%	100%	0%	80%

This combination of discrete skill discovery, symbolic planning, and differentiable trajectory refinement illustrates the capacity of VQ-CNMP for integrated neuro-symbolic planning in continuous control domains (Aktas et al., 2024).

7. Significance and Prospects

VQ-CNMP extends the expressiveness of CNMPs by introducing discrete, interpretable skill symbols, supporting unsupervised skill abstraction and reliable low-level action planning. Self-supervised fine-tuning resolves cluster assignment ambiguities and increases trajectory precision. The bi-level pipeline unifies symbolic reasoning and differentiable control, facilitating scalable skill discovery, labeling, planning, and execution in complex, sensorimotor environments.

A plausible implication is that future neuro-symbolic planners may further enhance generalization by incorporating active skill discovery, context-driven codebook adaptation, or closed-loop integration with multimodal LLMs for semantically grounded task specification and robust execution in unstructured environments (Aktas et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

VQ-CNMP: Neuro-Symbolic Skill Learning for Bi-Level Planning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Neural Movement Primitive (CNMP).