Command-Conditioned Locomotion

Updated 13 January 2026

Command-conditioned locomotion is a control framework that uses explicit high-level commands to directly modulate low-level movement in robots and animated characters.
It leverages reinforcement learning along with temporal dynamics and morphology-aware encoders to achieve robust generalization and sim-to-real transfer across varied platforms.
The approach integrates diverse command inputs—from velocity and posture to gait and semantic cues—facilitating agile, task-adaptive locomotion in complex environments.

Command-conditioned locomotion refers to the class of control frameworks—spanning robotics, graphics, and simulation—where an agent’s low-level locomotor behavior is directly modulated by explicit, parameterized high-level commands. These commands may encode desired velocities, body postures, waypoint targets, gait patterns, or even semantic/user intent. Command-conditioning supports agile, robust, and task-adaptive movement in legged robots, humanoids, soft robots, and animated characters. The field encompasses a spectrum of architectures, reinforcement learning (RL) methods, reward routing mechanisms, and system interfaces, with an emphasis on generalization, sim-to-real robustness, and applicability to diverse morphologies.

1. Problem Formulation and Command Spaces

Command-conditioned locomotion is formalized as an infinite-horizon Markov Decision Process (MDP) or, in partial observability settings, a Partially Observable MDP (POMDP). At each decision step $t$ , the agent receives:

A proprioceptive or proprioceptive+exteroceptive observation vector %%%%1%%%%
An explicit command vector $c_t$ $c_{t}$ specifying task objectives such as:
- Desired base linear and angular velocities: $c_t = [v_x^\text{des}, v_y^\text{des}, \omega_z^\text{des}]$
- Posture goals: $c_t = [h_\text{cmd}, \theta_\text{cmd}, \phi_\text{cmd}]$ (body height, pitch, roll)
- Gait identity or phase: one-hot encodings or phase parameters
- Waypoints: $\Delta x, \Delta y$ in robot or world frame
- Contact patterns: binary matrices over legs and timesteps
- Semantic goals, e.g., landmark destinations or natural language utterances (VR/LLM applications)

The policy $\pi(a_t | s_t, c_t, z_t)$ outputs action $a_t$ , possibly conditioned on a learned dynamics embedding $z_t$ to account for platform-specific properties or unmodeled environmental dynamics. Action spaces are typically desired joint position offsets, torques, or actuation targets for tracking via PD, QP, or learned controllers (Rytz et al., 21 May 2025, Miao et al., 6 Mar 2025, Xue et al., 5 Feb 2025, Atanassov et al., 22 Sep 2025, Peng et al., 27 May 2025).

2. Dynamics Conditioning and Morphological Generalization

A central challenge in generalizing command-conditioned policies across robot morphologies and dynamic regimes is robust input encoding and adaptation. Approaches include:

Temporal Dynamics Encoders (GRU-based): Learning a hidden state over recent state trajectories; the hidden vector $z_t$ summarizes temporally local dynamics and is concatenated with ( $s_t, c_t$ ) before policy evaluation (Rytz et al., 21 May 2025).
Morphology-Aware Embeddings: Encoding the robot's physical parameters (mass, inertias, link lengths, friction coefficients, PD gains) into a morphology vector $m$ , which is mapped via an MLP ( $g_\text{morph}(m)$ ) to $z$ and concatenated with $s_t$ (Rytz et al., 21 May 2025). Morphology-conditioning was found to yield superior command-tracking and zero-shot transfer, particularly for heavier, highly varied platforms.
Large-Scale Randomization: Training policies across procedurally generated robot and terrain instances exposes the policy to a wide envelope of dynamic conditions, supporting robust cross-robot policy deployment (Rytz et al., 21 May 2025, Miao et al., 6 Mar 2025).

These conditioning strategies trade off structural bias and adaptation flexibility. Morphology-aware embeddings provide strong priors with lower variance but higher bias, whereas history-based (GRU, RNN) encoders are adaptable but may overfit simulator artifacts (Rytz et al., 21 May 2025).

3. RL Architecture, Reward Structuring, and Curriculum

Command-conditioned locomotion policies are typically trained by model-free deep RL, often via Proximal Policy Optimization (PPO):

Policy Representation: Multi-layer perceptrons (MLPs), sometimes with recurrent (LSTM/GRU) components to model latent phase or long-term dependencies (Rytz et al., 21 May 2025, Peng et al., 27 May 2025, Mishra et al., 23 May 2025).
Reward Functions: Composite rewards combine:
- Velocity (and optionally posture) tracking: $r_\text{vel} = 1 - \tanh(\alpha \|v_\text{base} - c\|^2)$ or exponentiated quadratic forms
- Gait regularity, foot slip/clearance, energy/torque penalties, action smoothness
- Adversarial style rewards for naturalistic gaits (AMP, WGAN-div) where relevant (Huang et al., 26 Feb 2025, Miao et al., 6 Mar 2025)
- Contact- or phase-aligned objectives (e.g., tracking binary foot contacts or homogeneous gait clocks) (Atanassov et al., 22 Sep 2025, Tang et al., 2023, Tan et al., 2023)
Reward Routing and Gait IDs: For multi-gait/humanoid control, a compact reward router activates only those terms corresponding to the currently commanded gait, using a gait-ID one-hot vector to mitigate cross-objective interference (Peng et al., 27 May 2025).
Curriculum and History-Aware Learning: Automatic curriculum mechanisms can sequence commands for the RL agent, incrementally increasing task difficulty and range using an RNN-driven scheduler to account for prior performance (Mishra et al., 23 May 2025). Multi-phase curricula are essential to stably integrate complex behaviors such as standing, walking, and running within unified policies (Peng et al., 27 May 2025).
Observation Construction: Policies often concatenate the command input with proprioception, potentially privileged or history-augmented observations, and sometimes with learned embeddings representing exteroceptive features (heightmaps, depth data) or latent dynamics (Miao et al., 6 Mar 2025, Rytz et al., 21 May 2025, Xue et al., 5 Feb 2025).

4. Command Interfaces: Types and Abstractions

The diversity of command spaces underscores the field’s breadth:

Velocity + Posture Commanding: Standard in quadruped/humanoid controllers (e.g., $[v_x, v_y, \omega_z, h, \theta, \phi]$ ); supports low-level tracking with high-level motion planning input (Miao et al., 6 Mar 2025, Rytz et al., 21 May 2025, Xue et al., 5 Feb 2025).
Gait/Contact Conditioning: Gait is specified either directly via clock or phase variables, or as contact-patterns over legs/timesteps. Contact-conditioned controllers can execute diverse gaits, complex stepping sequences, and manipulation primitives by interpolating or switching contact objectives (Atanassov et al., 22 Sep 2025, Tang et al., 2023).
Waypoint and Trajectory Control: Robot motion is specified by a stream of waypoints, which the low-level policy is trained to approach, thus decoupling high-level navigation (classical planning or LLM outputs) from continuous tracking (Wang et al., 27 Jun 2025, Shi et al., 30 May 2025).
Textual/Natural Language and Semantic Inputs: LLM-based systems translate human commands or VR speech into precise movement coordinates or action codes, either via prompt engineering (VR teleportation) or by mapping user intent into contact-patterns or trajectory templates (Özdel et al., 24 Apr 2025, Tang et al., 2023).
High-Dimensional/Behavioral Commands: Controllers such as HugWBC expose a joint space of task and behavior parameters (velocity, stepping frequency, swing height, waist yaw, etc.) for fine-grained control (Xue et al., 5 Feb 2025).

5. Generalization, Robustness, and Sim-to-Real Transfer

Command-conditioned locomotion controllers exhibit strong generalization when trained under appropriate dynamics randomization, domain transfer curricula, and sufficient model structural bias:

Zero-shot Transfer: Policies conditioned on broad morphology/dynamics distributions achieved robust command tracking on unseen robot types without per-platform fine-tuning. E.g., PAL's zero-shot transfer from simulation to hardware quadrupeds and across different mass/inertia ranges (Rytz et al., 21 May 2025), and GeCCo's contact-conditioned policy performing on previously unseen stepping stones and manipulation scenarios (Atanassov et al., 22 Sep 2025).
Failure Modes and Correction: Careful policy design—conditioning on multiple reference models and capturing actuator delays, using light history or morphology encoders—reduces overfitting to any single simulator or robot, and minimizes command-tracking error up to 30% versus single-model training (Rytz et al., 21 May 2025). Dedicated auxiliary modules (e.g., recovery controllers, domain-adaptive trackers) mitigate instabilities due to rare events (Gangapurwala et al., 2020).
Preference and Multi-objective Trade-off: Recent works allow command weights (e.g., force-compliance versus tracking) to be adjusted in real-time, exposing a Pareto frontier of behaviors and supporting explicit navigation-compliance tradeoffs (Leng et al., 12 Oct 2025).

6. Applications and System Integration

Command-conditioned locomotion frameworks have been systematically deployed in multiple domains:

Legged Robotics: End-to-end command-to-joint policies on quadrupeds and humanoids, encompassing walking, running, standing, posture tracking, agile maneuvers over rough/unstructured terrain, and full-body loco-manipulation (Miao et al., 6 Mar 2025, Peng et al., 27 May 2025, Atanassov et al., 22 Sep 2025, Rytz et al., 21 May 2025, Xue et al., 5 Feb 2025, Gangapurwala et al., 2020).
Soft and Bioinspired Robotics: Command-driven gaits (walking, turning, swimming, payload transport) for soft robots leveraging magnetic or SMA actuation, using state-machines mapping command codes to excitation patterns (Javadi et al., 4 Oct 2025, Patterson et al., 2020).
Character Animation and VR: Real-time animation pipelines (e.g., MotionPersona) map user prompts, shape vectors, and trajectory commands into high fidelity, specific-character motions (Shi et al., 30 May 2025). LLM-driven locomotion in VR allows natural-language–to–action translation, broadening accessibility (Özdel et al., 24 Apr 2025).
Hierarchical and Hybrid Planning: Decoupled architectures process high-level navigation or user intent (text/vision/waypoints) and inject the result as commands to a general low-level locomotion skill module (Wang et al., 27 Jun 2025, Seo et al., 2022, Tan et al., 2023, Tang et al., 2023).

Performance across metrics—velocity tracking RMSE, task success, imitation distance, robustness to perturbations—consistently demonstrates the advantages of command-conditioned paradigms, with quantitative gains over reward-free or baseline systems (Rytz et al., 21 May 2025, Wang et al., 27 Jun 2025, Atanassov et al., 22 Sep 2025, Mishra et al., 23 May 2025, Shi et al., 30 May 2025).

7. Insights, Limitations, and Directions

Empirical findings from the literature highlight several key points:

Design Recommendations:
- Expose the RL policy to diverse morphology and dynamic parameters.
- Use morphology-conditioning for highly heterogeneous robot platforms; combine with light history-based encoders to handle transient unmodeled dynamics (Rytz et al., 21 May 2025).
- Employ reward routing or gait-ID selectors in multi-gait/multi-behavior controllers to avoid objective interference (Peng et al., 27 May 2025).
- Blend high-level planners (trajectory/LLM/text/waypoint) with robust low-level command-conditioned policies for scalable, generic navigation and manipulation (Wang et al., 27 Jun 2025, Atanassov et al., 22 Sep 2025).
Limitations:
- Overfitting of history-based dynamic inference modules to simulator idiosyncrasies may degrade transfer to hardware or larger platforms (Rytz et al., 21 May 2025).
- Pathological commands (e.g., waypoints inside untraversable obstacles) can still confound decoupled architectures unless rejection mechanisms are trained (Wang et al., 27 Jun 2025).
- Pure behavior cloning or single-command RL policies lack the flexibility and robustness necessary for deployment in heterogeneous or unpredictable environments (Peng et al., 27 May 2025, Huang et al., 26 Feb 2025).
- Real-time adaptation to entirely novel command spaces (e.g., new manipulator commands, complex team behaviors) remains an open area at scale.

Command-conditioned locomotion thus constitutes the organizing principle for versatile, robust, and user-adaptive movement in modern legged robotics and character animation. Its distinguishing features—explicit command input, broad generalization, and integration with high-level planning—define the prevailing approach to learning and deploying locomotive behavior in complex and unstructured scenes (Rytz et al., 21 May 2025, Peng et al., 27 May 2025, Wang et al., 27 Jun 2025, Atanassov et al., 22 Sep 2025, Xue et al., 5 Feb 2025, Özdel et al., 24 Apr 2025, Tang et al., 2023, Leng et al., 12 Oct 2025).