Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 105 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation (2505.19086v1)

Published 25 May 2025 in cs.RO, cs.AI, and cs.GR

Abstract: Humans interact with their world while leveraging precise full-body control to achieve versatile goals. This versatility allows them to solve long-horizon, underspecified problems, such as placing a cup in a sink, by seamlessly sequencing actions like approaching the cup, grasping, transporting it, and finally placing it in the sink. Such goal-driven control can enable new procedural tools for animation systems, enabling users to define partial objectives while the system naturally ``fills in'' the intermediate motions. However, while current methods for whole-body dexterous manipulation in physics-based animation achieve success in specific interaction tasks, they typically employ control paradigms (e.g., detailed kinematic motion tracking, continuous object trajectory following, or direct VR teleoperation) that offer limited versatility for high-level goal specification across the entire coupled human-object system. To bridge this gap, we present MaskedManipulator, a unified and generative policy developed through a two-stage learning approach. First, our system trains a tracking controller to physically reconstruct complex human-object interactions from large-scale human mocap datasets. This tracking controller is then distilled into MaskedManipulator, which provides users with intuitive control over both the character's body and the manipulated object. As a result, MaskedManipulator enables users to specify complex loco-manipulation tasks through intuitive high-level objectives (e.g., target object poses, key character stances), and MaskedManipulator then synthesizes the necessary full-body actions for a physically simulated humanoid to achieve these goals, paving the way for more interactive and life-like virtual characters.

Collections

Summary

MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation

The paper "MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation" introduces an innovative approach to bridge the gap between full-body locomotion and dexterous manipulation in physics-based animation systems. The authors present MaskedManipulator, a unified generative policy developed through a two-stage learning approach that leverages human motion capture to train a versatile control framework capable of achieving complex loco-manipulation tasks.

Overview

A significant challenge in simulating humanoid agents is achieving high precision in both whole-body locomotion and fine object manipulation. Current methods often fall short in their ability to generalize across diverse tasks due to the necessity of handling broad solution spaces while maintaining precise physical execution, which is crucial for intricate human-object interactions.

MaskedManipulator is designed to overcome these challenges by integrating spatiotemporal goal-conditioning for both the humanoid and the manipulated objects. The solution builds on human demonstrations, specifically from the GRAB dataset, enabling it to exhibit complex interaction sequences such as grasping, object relocation, and hand-to-hand transfers.

Technical Approach

1. MimicManipulator

The first stage, MimicManipulator, is a full-information physics-based tracking controller trained using reinforcement learning (RL). This system learns from the rich kinematic data of human-object interactions provided by motion capture, aiming to physically reconstruct these actions with high fidelity. The training incorporates robust reward formulations to ensure dynamic feasibility and emphasize nuances of object handling.

Reward Configuration: The reward function is designed to rigorously penalize discrepancies between simulated outcomes and reference motions, focusing on translation, rotation, contact positions, and velocities.
Prioritized Training: The learning process includes mechanisms like prioritized sampling to emphasize more complex and failed sequences, improving robust performance across diverse interaction tasks.

2. MaskedManipulator

The second stage distills the learned expertise of MimicManipulator into MaskedManipulator, which is trained via online teacher-student distillation. This involves masking sections of goal specification, thus allowing for versatile control using sparse objectives.

Policy Architecture: MaskedManipulator utilizes a transformer architecture to handle variable-length goals and encode them as distinct tokens. It further explores three architectures: deterministic, Conditional Variational Autoencoder (C-VAE), and Diffusion models, each offering distinct advantages in terms of versatility and generalization.
Generative Control: The Diffusion policy highlights the capability of effectively generating novel, physically plausible behaviors, thereby enhancing the practical utility of humanoid control in unknown scenarios.

Results

The findings exhibit quantitative success in achieving complex and concatenated manipulation tasks such as teleoperation-style pose matching and long-horizon sparse goal chaining. The Diffusion-based approach notably excels in generalization, effectively synthesizing human-like actions from under-specified goals while maintaining high success rates compared to deterministic and other stochastic models.

Implications and Future Directions

The MaskedManipulator framework exhibits significant promise for advancing the field of character animation and robotics. Its capability to generate diverse behaviors in response to sparse high-level goals offers potential applications in interactive environments where lifelike, adaptive humanoid figures are required.

For future advancements, extending control granularity and further refining reconstruction coverage would enhance system precision. Addressing these areas may facilitate finer control over specific manipulation strategies, such as exact contact location specification on objects, thereby broadening the usability of MaskedManipulator in both animation and real-world robotics applications.

The methodology presents a comprehensive framework toward achieving realistic, adaptable whole-body humanoid control, establishing a foundation for subsequent exploration into deeper system integration with additional sensory inputs and complex interaction environments.