Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conditional Neural Movement Primitives

Updated 14 March 2026
  • Conditional Neural Movement Primitives (CNMPs) are conditional, probabilistic auto-encoders that model continuous robot trajectories from demonstration data.
  • Vector-Quantized CNMP (VQ-CNMP) introduces a discrete skill bottleneck to cluster demonstrations into symbolic skills while preserving trajectory execution fidelity.
  • The neuro-symbolic planning pipeline integrates high-level LLM-based symbolic planning with gradient-based low-level refinement for precise robotic control.

Conditional Neural Movement Primitives (CNMPs) are conditional, probabilistic auto-encoders that model continuous time-series such as robot trajectories. The CNMP framework is designed for learning from demonstration, enabling unsupervised discovery and encoding of complex skills, as well as providing high-fidelity reconstruction and inference of trajectories. Vector-Quantized CNMP (VQ-CNMP) augments the CNMP architecture with a discrete skill bottleneck, supporting unsupervised clustering of demonstrations into symbolic skills while retaining the expressiveness of continuous representations for trajectory execution. VQ-CNMP serves as the backbone of a neuro-symbolic bi-level planning pipeline, enabling both symbolic high-level planning with LLMs and differentiable low-level trajectory refinement via gradient-based optimization (Aktas et al., 2024).

1. Foundations of Conditional Neural Movement Primitives

CNMPs represent an unsupervised sequence modeling framework for robot trajectories, operational over datasets of demonstrations D={D1,,DM},\mathcal{D} = \{ D_1, \ldots, D_M \}, where each demonstration DjD_j is a time-ordered series of sensorimotor pairs:

Dj={(ti,xi)}i=1nj,tiR, xiRd.D_j = \{ (t_i, x_i) \}_{i=1}^{n_j}, \quad t_i \in \mathbb{R}, \ x_i \in \mathbb{R}^d.

Each (ti,xi)(t_i, x_i) is encoded via a neural function EθE_\theta, yielding pointwise latents ziz_i:

zi=Eθ(ti,xi).z_i = E_\theta(t_i, x_i).

Contextual aggregation is performed by averaging these latents:

ze=1ni=1nzi.z_e = \frac{1}{n} \sum_{i=1}^n z_i.

The aggregated latent zez_e is concatenated with a query timestamp tt^* and passed to a decoder QϕQ_\phi, producing the parameters of a Gaussian distribution over the sensorimotor state:

(μ(t),σ2(t))=Qϕ([ze;t]).(\mu(t^*), \sigma^2(t^*)) = Q_\phi([z_e; t^*]).

This defines the conditional likelihood of the state xx at time tt^*:

p(xt,ze)=N(x;μ(t),σ2(t)).p(x \mid t^*, z_e) = \mathcal{N}(x; \mu(t^*), \sigma^2(t^*)).

The training objective is the negative log-likelihood over held-out queries,

LCNMP=(t,x)querieslogp(xt,ze).\mathcal{L}_{\text{CNMP}} = - \sum_{(t^*, x) \in \text{queries}} \log p(x \mid t^*, z_e).

This configuration enables CNMPs to learn flexible, temporally-conditioned generative models from unordered demonstration segments.

2. Vector-Quantized CNMP: Discrete Skill Bottleneck

VQ-CNMP introduces a discrete codebook V={v1,,vK}\mathcal{V} = \{ v_1, \ldots, v_K \}, vkRDv_k \in \mathbb{R}^D, creating a symbolic bottleneck for skill discovery. After context aggregation, the continuous latent zez_e is quantized:

k=argminm=1,,Kzevm2,zq=vk.k^* = \arg \min_{m = 1, \ldots, K} \| z_e - v_m \|_2, \quad z_q = v_{k^*}.

Decoding then proceeds as in the original architecture:

(μ(t),σ2(t))=Qϕ([zq;t]).(\mu(t^*), \sigma^2(t^*)) = Q_\phi([z_q; t^*]).

The VQ-CNMP loss combines the CNMP likelihood with the vector-quantization and commitment penalties:

L=LCNMP+sg(ze)zq22+βzesg(zq)22,\mathcal{L} = \mathcal{L}_{\text{CNMP}} + \| \text{sg}(z_e) - z_q \|_2^2 + \beta \| z_e - \text{sg}(z_q) \|_2^2,

where sg()\text{sg}(\cdot) denotes the stop-gradient operator and β\beta is a small constant (standard choice β=0.25\beta = 0.25).

Codebook optimization employs standard SGD, updating each codeword towards encoder outputs selected during the quantization step.

3. Self-Supervised Fine-Tuning for Latent Precision

Following initial unsupervised training, each trajectory demonstration is assigned a codebook index kk^*. Demonstrations are clustered by their discrete assignments. A second CNMP is then re-trained in a fully supervised regime: trajectories from each cluster are encoded and decoded with the corresponding fixed latent zq=vkz_q = v_k. This procedure greatly enhances trajectory reconstruction precision and ensures higher success rates in downstream low-level planning tasks.

4. Neuro-Symbolic Bi-Level Planning Pipeline

The bi-level pipeline exploits the structure learned in VQ-CNMP, integrating symbolic planning at the skill level with continuous refinement for physical task execution.

  • High-Level Skill Planning: Each codebook vector vkv_k is associated with a skill description (e.g., "take tomato from left cupboard and add to pan"). A multi-modal LLM (e.g., ChatGPT-4o) is prompted with an overhead image, the list of skill descriptions, and an NL-formulated goal. The LLM returns an ordered list of skill indices [k1,,kN][k_1, \ldots, k_N] to achieve the specified goal.
  • Low-Level Trajectory Planning (Gradient-Based Optimization): For each high-level step kik_i, the latent z(0)=vkiz^{(0)} = v_{k_i} is refined to align trajectory predictions with precise environmental constraints (e.g., object positions). The loss at contact time tct_c is

L(j)=μcpobj22,L^{(j)} = \| \mu_c - p_{\text{obj}} \|_2^2,

with pobjp_{\text{obj}} as the target 3D position. Gradient descent is applied to zz:

z(j+1)=z(j)αzL(j).z^{(j+1)} = z^{(j)} - \alpha \nabla_z L^{(j)}.

Refinement iterates until convergence (μcpobj<ϵ\| \mu_c - p_{\text{obj}} \| < \epsilon), after which the decoder is used to generate the full low-level trajectory for the skill step. Final plans concatenate these trajectories sequentially.

5. Modifications to Inference and Comparison to Standard CNMP

At inference, standard CNMPs encode context and decode from the continuous latent zez_e, preserving the original trajectory distribution. In VQ-CNMP, inference defaults to hard quantization: the nearest codebook vector vkv_k is used for decoding, optionally refined through the aforementioned gradient-based update for task-specific adjustments. Post self-supervised fine-tuning, inference rolls in using the assigned vkv_k directly, eschewing nearest-neighbor search in favor of deterministic encoding.

This bifurcation between standard and vector-quantized inference regimes enables flexible trade-offs between symbolic skill abstraction and low-level control fidelity.

6. Empirical Validation: Skill Discovery, Symbolic Planning, and Control

Experiments were performed in a simulated kitchen with five parametrized skills, each with 100 demonstration trajectories and randomized object placements. Key quantitative results include:

  • Skill Discovery: Training 100 VQ-CNMPs with K=5K=5 codewords, perfect cluster-to-skill mapping was achieved in 27% of runs; mean clustering accuracy was 0.97. Increasing KK above the true number of skills gradually reduced clustering performance, as excess codes subdivide true skill classes.
  • Skill Labeling by LLM: Using decoded trajectories as visual prompts for ChatGPT-4o, labeling accuracy with five consecutive snapshots was approximately 60% (randomized snapshots up to 71% for "pickup tomato"), indicating incomplete but non-trivial recognition by state-of-the-art LLMs.
  • High-Level Planning Accuracy: Given only the five required skills, plans generated for assembling "stew of X, Y, Z" had 59.7% symbolic correctness. Introducing distractor skills (relevant or irrelevant) decreased performance, but explicit provision of object locations and hidden-object skills increased success up to 98.4%.
  • Low-Level Trajectory Execution: Self-supervised fine-tuned VQ-CNMP significantly exceeded unsupervised baselines in control success (e.g., for "pickup": 100% vs. 80-90%, for "put-in-pan": 80-90% vs. 0%), with convergence times of ~70 versus ~400 gradient steps.
Skill Pickup (unsup.) Pickup (self-sup.) Put-in-pan (unsup.) Put-in-pan (self-sup.)
Tomato 80% 100% 0% 90%
Mushroom 40% 100% 0% 85%
Potato 0% 90% 0% 80%
Salt 40% 100% 0% 85%
Oil 90% 100% 0% 80%

This combination of discrete skill discovery, symbolic planning, and differentiable trajectory refinement illustrates the capacity of VQ-CNMP for integrated neuro-symbolic planning in continuous control domains (Aktas et al., 2024).

7. Significance and Prospects

VQ-CNMP extends the expressiveness of CNMPs by introducing discrete, interpretable skill symbols, supporting unsupervised skill abstraction and reliable low-level action planning. Self-supervised fine-tuning resolves cluster assignment ambiguities and increases trajectory precision. The bi-level pipeline unifies symbolic reasoning and differentiable control, facilitating scalable skill discovery, labeling, planning, and execution in complex, sensorimotor environments.

A plausible implication is that future neuro-symbolic planners may further enhance generalization by incorporating active skill discovery, context-driven codebook adaptation, or closed-loop integration with multimodal LLMs for semantically grounded task specification and robust execution in unstructured environments (Aktas et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Neural Movement Primitive (CNMP).