Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

Published 17 Sep 2025 in cs.RO, cs.AI, and cs.LG | (2509.14353v1)

Abstract: We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural looking motions, aiding in sim-to-real transfer. We validate DreamControl's effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction.

Summary

  • The paper introduces DreamControl, a novel method that integrates guided diffusion-based human motion priors with reinforcement learning to achieve human-like whole-body control.
  • It employs a two-stage approach, using human motion synthesis and RL policy training, resulting in superior performance on 9 out of 11 complex loco-manipulation tasks.
  • Quantitative metrics and user studies validate its natural motion generation, demonstrating promising scalability for humanoid robotics applications.

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

Introduction and Motivation

The paper presents DreamControl, a methodology for autonomous whole-body humanoid skill acquisition that integrates guided diffusion models with reinforcement learning (RL). The approach is motivated by the limitations of current humanoid control paradigms, which often rely on teleoperation data, focus on upper-body manipulation, or require extensive reward engineering for RL. DreamControl leverages abundant human motion data to inform a diffusion prior, which guides RL policies in simulation for complex loco-manipulation tasks involving both lower and upper body coordination and object interaction.

Methodology

Stage 1: Human Motion Prior via Guided Diffusion

DreamControl utilizes OmniControl, a diffusion transformer trained on large-scale human motion datasets (e.g., HumanML3D), capable of generating trajectories conditioned on text and spatiotemporal guidance. This enables the synthesis of realistic, human-like motion plans for a wide variety of tasks, such as picking up objects, opening drawers, or pressing buttons. The generated trajectories are retargeted to the Unitree G1 robot using kinematic optimization (PyRoki), accounting for differences in morphology and physical constraints. Post-processing filters out dynamically infeasible or collision-prone trajectories using heuristics on torso angle, pelvis height, and scene collisions.

Stage 2: RL Policy Training with Reference Trajectories

The RL stage formulates each task as a goal-conditioned Markov Decision Process, where the agent receives proprioceptive observations, future reference trajectory frames, and privileged task-specific information. The action space comprises 27-DoF body joint angles and binary hand states. The reward function combines dense tracking terms (joint angles, keypoints, root position/orientation, hand states), smoothness penalties (torques, accelerations, action rate changes, foot sliding), and sparse task-specific rewards (e.g., object height for pick, button press proximity).

Policies are trained in IsaacLab using PPO with 8192 parallel environments for 2000 iterations. Scene synthesis ensures that objects are placed at locations corresponding to interaction points in the reference trajectory, facilitating exploration and task completion.

Experimental Results

Task Library and Baselines

DreamControl is evaluated on 11 tasks: Pick, Bimanual Pick, Pick from Ground (side/top grasp), Press Button, Open Drawer, Open Door, Precise Punch, Precise Kick, Jump, and Sit. Baselines include TaskOnly (sparse rewards), TaskOnly+ (sparse + engineered dense rewards), and TrackingOnly (tracking rewards).

Key findings:

  • TaskOnly achieves 0% success across all tasks, indicating that sparse rewards alone are insufficient for discovering meaningful whole-body motions.
  • TaskOnly+ improves on simple tasks but fails on those requiring coordinated whole-body movement (e.g., Pick from Ground, Jump).
  • TrackingOnly performs well on motion-centric tasks but struggles with fine-grained object interactions.
  • DreamControl outperforms all baselines, achieving the highest success rates on 9/11 tasks, with robust performance on both manipulation and loco-manipulation tasks.

Human-likeness and Motion Quality

Human-likeness is quantitatively assessed using Frechet Inception Distance (FID) on HumanML3D, average absolute jerk, and a user study. DreamControl consistently yields lower FID and jerk values, indicating closer alignment with human motion and smoother trajectories. In the user study, 82–95% of participants preferred DreamControl's trajectories over TaskOnly+, confirming the qualitative improvement in naturalness.

Sim-to-Real Transfer

Policies are successfully deployed on the Unitree G1 robot for selected tasks after removing privileged observations and adapting the input space for real-world constraints. Object localization is performed using OWLv2 for 2D detection and depth estimation. Lower body is frozen during interaction tasks to mitigate perception latency, and root velocity penalties are added for stability. DreamControl demonstrates effective sim-to-real transfer, with successful execution of pick, bimanual pick, button press, drawer opening, punch, and squat tasks.

Implementation Details

  • Trajectory Generation: OmniControl is used in zero-shot mode, with task-specific text prompts and spatiotemporal guidance for interaction points. Retargeting is performed via PyRoki, optimizing keypoint positions and joint angles.
  • Filtering and Refinement: Heuristic-based filtering removes infeasible trajectories. Additional refinements ensure consistent initial poses and disable non-functional limbs for each task.
  • RL Architecture: Policies and critics are implemented as MLPs with hidden layers (512, 256, 256). Observations include proprioception, future reference frames, and task-specific information.
  • Reward Engineering: Dense tracking and smoothness terms are weighted per task. Sparse rewards are designed to be indicative of task success (e.g., object height, proximity to goal).
  • Environment Randomization: Object location, mass, and friction are randomized to promote robustness and generalization.

Implications and Future Directions

DreamControl demonstrates that guided diffusion priors over human motion can substantially improve RL-based whole-body humanoid control, enabling robust, natural, and transferable skills without reliance on teleoperation data. The approach reduces the need for reward engineering and facilitates sim-to-real transfer by promoting human-like motion plans.

Practical implications:

  • Enables scalable skill acquisition for humanoid robots using abundant human motion data.
  • Facilitates deployment in real-world environments with minimal adaptation.
  • Provides a framework for integrating generative models with RL for complex, long-horizon tasks.

Theoretical implications:

  • Validates the utility of multimodal generative priors for exploration and policy learning in high-dimensional, long-horizon control problems.
  • Highlights the importance of fine-grained spatiotemporal guidance for scene interaction and task completion.

Future work:

  • Extension to skill composition, dexterous manipulation, and complex object geometries.
  • Scaling to broader task repertoires and diverse robot morphologies.
  • Integration with vision-based policies and student-teacher distillation for end-to-end perception-action learning.

Conclusion

DreamControl establishes a data-efficient paradigm for autonomous whole-body humanoid control by combining guided diffusion priors with RL. The methodology achieves high success rates, superior human-likeness, and effective sim-to-real transfer across a diverse set of tasks. The results suggest that leveraging human motion priors via generative models is a promising direction for scalable, general-purpose humanoid robotics.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

DreamControl: A simple explanation

Overview

This paper introduces DreamControl, a way to teach humanoid robots full-body skills—like opening a drawer, picking up a box, pressing a button, jumping, and sitting—so they move more like humans and can handle real-world scenes. The key idea is to combine two kinds of AI:

  • A motion “planner” that dreams up human-like movements (using a diffusion model trained on human motion).
  • Trial-and-error learning (reinforcement learning, or RL) that turns those plans into reliable robot actions.

Together, these let robots learn complex tasks that need both balance and precise arm/hand control.

What problems are they trying to solve?

In simple terms, the paper aims to answer:

  • How can a humanoid robot learn whole-body tasks (using legs, arms, and hands together) without collecting huge amounts of robot demonstration data?
  • How can we help robots figure out long, multi-step movements (like reaching down, gripping, then lifting) while still keeping balance and moving smoothly?
  • How can we make the robot’s motions look natural—more “human”—so skills learned in simulation work better in the real world?

How does DreamControl work?

Think of DreamControl in two stages. First, it “dreams” up good-looking, human-like motions. Then, it “practices” those motions until it can do them on its own.

Here’s the approach, explained with everyday ideas:

  • Stage 1: Dreaming up motion plans
    • A diffusion model (like an artist that creates motion sketches from hints) is trained on human movement data. It takes a text instruction (for example, “press the elevator button”) plus a spatial-time hint, like “the right hand should touch this spot at 3 seconds.”
    • This produces a full human-like motion sequence that fits the scene (e.g., moving toward the button).
    • Because humans and robots have different bodies, the plan is “retargeted” (like tailoring clothes) to fit the robot’s shape and joint limits. Extra checks and fixes remove unrealistic motions.
  • Stage 2: Practicing via reinforcement learning (RL)
    • RL is like practicing a sport with scores. The robot tries the motion in a simulator and gets points (rewards) for:
    • Tracking the “dreamed” motion (so it looks natural and stays balanced).
    • Actually finishing the task (like successfully picking up an object).
    • The scene is set up to match the motion plan: if the plan says “touch this point at time t,” the simulator places the object near that location. This makes it easier for the robot to discover what actions lead to success.
    • Over many practice rounds, the robot learns a policy (a set of rules) that can complete the task reliably, without needing the motion model during the actual run.

Key terms in simple language:

  • Diffusion model: A generator that creates realistic motion by starting with rough guesses and refining them, guided by text and “touch this spot at this time” hints.
  • Prior: A helpful starting point or bias from what humans usually do, so the robot’s plan is sensible.
  • Retargeting: Adjusting a human motion to fit the robot’s body and joints.
  • RL (reinforcement learning): Learning by trial and error with rewards for good behavior.
  • Sim-to-real: Training in a virtual world (simulation) and then using that skill on a real robot.

What did they find, and why does it matter?

They tested DreamControl on a Unitree G1 humanoid robot in simulation and then on real hardware. The tasks included:

  • Picking up objects (single-hand and two-hand lifts)
  • Picking from the ground (side and top grasps)
  • Pressing a button
  • Opening a drawer and a door
  • Precise punch and kick
  • Jumping and sitting

What the results showed:

  • DreamControl consistently outperformed methods that used only task rewards or only tracking. On 9 of 11 tasks, it had the best success rates, often near or at 100% in simulation.
  • Robots moved more “human-like”:
    • Lower “jerk” (less jittery motion), meaning smoother movement.
    • Better similarity scores to human motion data.
    • In user studies, people overwhelmingly preferred DreamControl’s movements as more natural.
  • Real-world tests worked:
    • They deployed policies to a physical humanoid for several tasks (like bimanual picking and pressing a button).
    • They used a depth camera and an object detection model to find the target in 3D, then executed the learned policy. Because real-time vision can be slow, they kept some parts of the robot steady to stay safe and reliable.

Why it’s important:

  • It reduces the need for expensive robot demonstration data. Instead, it learns from human motion data, which is easier to get (e.g., motion capture or video).
  • It helps robots find workable, natural-looking solutions that simple trial-and-error often misses.
  • Natural motions transfer better from simulation to the real world and are safer around people.

What’s the bigger impact?

DreamControl is a promising step toward more capable humanoid assistants that can use their whole bodies to interact with the world. By “dreaming” human-like motion plans and “practicing” them until they’re robust, robots can learn a wide range of skills with less hand-tuned reward design and less demonstration data.

Future directions include:

  • Composing multiple skills into longer activities (e.g., open the drawer, then retrieve and carry an item).
  • More dexterous hand control (not just open/close).
  • Better perception and faster vision, so the robot can move dynamically while tracking objects.

Overall, DreamControl shows that combining guided motion generation with reinforcement learning makes humanoid robots more skilled, more natural, and more practical in real settings.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of gaps and open issues left unresolved by the paper that future work could address.

  • Lack of quantitative sim-to-real evaluation: the paper reports real-world deployments qualitatively (videos) but provides no success rates, error bars, or robustness statistics across trials, objects, and scenes; designing controlled benchmarks and reporting standardized metrics for sim-to-real transfer is needed.
  • Perception bottleneck and static object pose assumption: OWLv2 is run only on the first frame due to latency, forcing a fixed object estimate and freezing the lower body; evaluating and integrating low-latency detectors/trackers (e.g., distilled OWL, ASYNC tracking, depth-temporal filtering) to enable closed-loop perception and active base motion is an open direction.
  • Limited real-world task coverage: only a subset of the 11 simulated skills are deployed (e.g., no real-world jump, door opening, kick); extending deployment to all skills and analyzing task-specific failure modes would strengthen claims.
  • No end-to-end visuomotor policy: real hardware relies on hand-engineered perception-to-state pipelines; training vision-conditioned policies (teacher–student distillation, end-to-end from pixels) and quantifying gains vs. privileged-state policies are missing.
  • Reliance on binary hand control: hands are controlled as open/close; exploring dexterous manipulation (finger-level control, compliance, force control, tactile feedback) and evaluating on in-hand tasks and challenging grasps remains open.
  • Skill composition and hierarchy: the method trains per-task primitive skills; mechanisms for composing primitives (hierarchical RL, skill graphs, options) for long-horizon multi-step tasks (e.g., walk → reach → grasp → place) are not developed.
  • Object geometry and contact modeling: interactions use simple objects and limited friction/mass randomization; generalization to complex geometries (handles, deformables, articulated objects), contact-rich manipulation (force closure, compliance), and explicit contact planning is not addressed.
  • Locomotion-integrated manipulation: many deployments freeze the lower body; evaluating loco-manipulation that requires stepping, foot placement, balance recovery, or terrain variation (stairs, slopes) is not covered.
  • Domain gap from human priors: OmniControl is trained on HumanML3D (often free-space motions, limited object interactions); the impact of dataset biases and morphology differences (e.g., G1’s height causing shoulder-level picks) on policy performance is not quantified; fine-tuning the prior with robot-centric/object-interaction datasets is an open question.
  • Out-of-distribution task handling is ad hoc: for tasks outside the prior (e.g., drawer pulling), IK-based trajectory synthesis is used without a principled evaluation; a systematic method to adapt/extend the diffusion prior to novel tasks (e.g., via conditional fine-tuning, video-to-motion distillation) is needed.
  • Trajectory retargeting and feasibility: retargeting uses optimization with heuristic residuals, and trajectories are filtered post hoc; incorporating physics-aware constraints during generation (contact consistency, torque/velocity limits) or learning a feasibility classifier to pre-filter trajectories is not explored.
  • Reference trajectory dependence and time-indexing: the non-privileged (deployment) policy removes reference observations but adds time encoding; analyzing how time-indexing impacts reactivity, recovery, and variability (e.g., interruptions, moving targets) and whether policies can be fully event/state-driven is an open issue.
  • Reward design sensitivity: despite claims of reduced reward engineering, the approach still uses many tracking and smoothing terms plus task-specific sparse rewards; sensitivity analyses of reward weights, ablations of tracking terms, and automatic reward tuning (e.g., meta-RL, preference-based RL) are missing.
  • Guidance specification and automation: spatiotemporal guidance is manually designed per task; methods to infer guidance from perception (e.g., affordance prediction, keypoint detectors, language-grounded spatial targets) and evaluate guidance accuracy vs. task success are needed.
  • Prior selection and comparison: OmniControl is chosen without benchmarking against alternative priors (e.g., transformer autoregressive models, flow-matching motion generators, PULSE-like VAEs); comparative studies on guidance fidelity, diversity, sample efficiency, and downstream RL performance are absent.
  • Sample efficiency and compute costs: training uses PPO with 8192 parallel environments for 2000 iterations; reporting environment steps, wall-clock time, and comparing sample efficiency vs. baselines (with and without priors) would clarify scalability.
  • Long-horizon limits: trajectories are capped at ~9.8 s; evaluating tasks requiring tens of seconds and memory (e.g., navigation → manipulation → placement) and exploring recurrent/transformer critics/policies for long-horizon credit assignment is left open.
  • Stability, safety, and recovery: while foot sliding and smoothness are penalized, there is no explicit treatment of safety (fall recovery, collision avoidance), or formal stability constraints; developing safety-aware training and runtime monitors would improve reliability.
  • Scene complexity and clutter: environments appear relatively simple with a single target object; testing in cluttered scenes (occlusions, distractors), multi-object tasks, and variable lighting/sensing conditions is needed to validate robustness.
  • Generalization across morphologies: the method is validated on Unitree G1; assessing portability to different humanoids (e.g., Digit, GR1), varied hand morphologies, and non-humanoid manipulators via systematic retargeting studies remains open.
  • Language-conditioned autonomy: although the prior accepts text, the deployed system does not leverage language at inference; integrating language-conditioned planning and perception (text-to-guidance, open-vocabulary task execution) and evaluating task generalization across prompts is an open question.
  • Quantifying sim fidelity and model mismatch: the paper does not characterize IsaacSim → hardware dynamics gaps (joint friction, delays, compliance); systematic identification, calibration, and domain randomization strategies tailored to whole-body manipulation need evaluation.
  • Metrics for “human-ness”: FID on HumanML3D and jerk are used, but FID may not correlate with naturalness for robot morphology; exploring morphology-normalized metrics, contact-consistency scores, and human preference models trained on robot videos would yield more reliable assessments.
  • Robustness to perception noise and drift: there is no stress testing against pose estimation error, camera extrinsics miscalibration, occlusions, or moving targets; quantifying tolerance thresholds and adding closed-loop correction mechanisms (e.g., visual servoing) are important next steps.
  • Policy adaptability under distribution shift: policies appear time-scripted and may not adapt when the object or environment changes mid-episode; evaluating and enhancing adaptability (e.g., via online RL, meta-learning, or model-based replanning) is an open avenue.
  • Explicit evaluation of tracking vs. task rewards: the interplay between tracking the prior and achieving the task is not dissected (e.g., when tracking hurts task success or vice versa); diagnostics that measure deviation-from-prior vs. task completion would inform when to relax tracking.
  • Real-world contact forces and compliance: no measurements or limits on contact forces (e.g., during punch/kick, button press) are reported; incorporating force sensing, compliance control, and safety thresholds, and studying their impact on success and hardware wear are needed.
  • Automatic scene synthesis alignment: in sim, objects are placed using reference trajectory frames; devising perception-driven scene alignment that matches the guidance generation on hardware (e.g., estimating object pose, affordances, and offsets consistently) and studying mismatch effects is an open problem.

Glossary

  • Ablations: controlled experimental variations used to validate design choices or components. "alongside ablations that validate our design choices."
  • Bimanual manipulation: manipulation that requires coordinated control of both arms and hands simultaneously. "The long-horizon and high-dimensional nature of bimanual manipulation leads to a particularly challenging RL exploration problem"
  • Bottleneck VAE: a variational autoencoder with a constrained latent space used to learn compact representations. "taking the form of a bottleneck VAE that directly predicts actions"
  • Dense reward: a reward signal provided at many time steps, offering continuous guidance during learning. "trained with a dense reward for accurately tracking keypoints from an input trajectory"
  • Diffusion policies: policies parameterized by diffusion models to generate actions or trajectories. "diffusion policies~\cite{chi2023diffusion} (and related flow matching based approaches~\cite{black2410pi0}) have shown promise"
  • Diffusion prior: a generative prior over motions or actions learned via diffusion models. "the use of a diffusion prior trained on human motion data"
  • Diffusion transformer: a transformer-based diffusion model for sequence generation. "we thus use a diffusion transformer~\cite{peebles2023scalable,tevet2023human,chi2023diffusion}"
  • DoF (Degrees of Freedom): independent axes of motion or control in a robot mechanism. "Our simulated robot is a 27-DoF Unitree G1"
  • Domain gap: a mismatch between training and deployment data distributions that hinders generalization. "An exception occurs in the Pick task, which we conjecture is due to a domain gap:"
  • Flow matching: a generative modeling technique that matches probability flows instead of diffusion steps. "flow matching based approaches~\cite{black2410pi0}"
  • Fréchet Inception Distance (FID): a metric for comparing distributions of generated and real trajectories or images. "We report FID and jerk (m/s3m/s^3)"
  • Goal-conditioned RL: reinforcement learning where policies are conditioned on an explicit goal specification. "we train goal-conditioned RL policies to track these generated trajectories while completing some task of interest"
  • Human motion prior: a statistical prior learned from human motion data to guide planning or control. "first generating motion plans externally through a pre-trained human motion prior."
  • HumanML3D: a dataset of human motion used to train motion models. "Since OmniControl is trained on HumanML3d~\cite{Guo_2022_CVPR}"
  • Imitation Learning (IL): learning policies by mimicking expert demonstrations. "require teleoperated trajectories for IL."
  • Inpainting: filling or synthesizing missing parts of a sequence or image given constraints. "our trajectory generation stage is analogous to image or video inpainting."
  • Inverse kinematics (IK): computing joint configurations that achieve desired end-effector poses. "by using IK-based optimization"
  • Jerk: the third derivative of position; a measure of motion smoothness. "we calculate average absolute jerk"
  • Keypoints: specific 3D positions on a robot or human body used for tracking and control. "tracking keypoints from an input trajectory"
  • Loco-manipulation: combined locomotion and manipulation tasks requiring whole-body coordination. "whole-body manipulation and loco-manipulation tasks"
  • Long-horizon: tasks or planning problems that unfold over many seconds or steps. "which is a long-horizon problem, spanning up to tens of seconds."
  • Motion retargeting: adapting human motion trajectories to a robot’s morphology. "retarget these generated trajectories to the G1 form factor"
  • Multimodal (action distributions): distributions with multiple plausible modes reflecting diverse behaviors. "a natural fit for the multimodal nature of action distributions in manipulation"
  • On-policy RL: reinforcement learning that updates policies using data collected from the current policy. "There are also on-policy RL approaches trained in simulated environments"
  • Open-vocabulary object detection: detecting objects from free-form text labels not restricted to a fixed set. "an off-the-shelf open-vocabulary object detection model, OWLv2"
  • PPO (Proximal Policy Optimization): a popular on-policy RL algorithm for stable policy updates. "using PPO~\cite{schulman2017proximal}"
  • Privileged information: simulator-only state or physics data unavailable in real deployments. "a 'privileged' variant with access to internal simulator states"
  • Proprioception: internal sensing of joint positions, velocities, and body dynamics. "we include proprioception information"
  • Quaternions: a four-parameter representation of 3D orientation used to avoid singularities. "the orientation of the root in quaternions"
  • Reward engineering: designing reward functions to guide RL toward desired behaviors. "which is very hard without careful reward engineering"
  • Sim-to-real transfer: deploying policies learned in simulation to real robots successfully. "aiding in sim-to-real transfer."
  • Spatiotemporal guidance: constraints specifying where and when parts of the body should be during motion generation. "spatiotemporal guidance (e.g., enforcing a wrist position at a specific time)"
  • Student-teacher distillation: training a compact policy (student) to imitate a stronger model (teacher). "vision-based policies could be trained via student-teacher distillation (e.g.,~\cite{lin2025sim})"
  • Teleoperation: controlling a robot remotely by a human operator to collect trajectories. "collecting large teleoperation data can be labor-intensive and difficult to scale"
  • Tracking rewards: rewards that encourage following a reference trajectory closely. "TrackingOnly: only tracking rewards."
  • Underactuation: systems with fewer control inputs than degrees of freedom, limiting direct control. "due to high degrees of freedom, underactuation, and a high center of mass."
  • Zero-shot: using a model on new tasks without task-specific finetuning. "in a 'zero-shot' fashion"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 204 likes about this paper.