Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 81 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos (2509.09769v1)

Published 11 Sep 2025 in cs.RO

Abstract: We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos -- continuous, unlabeled videos of people interacting freely with their environment -- as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data. MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly twofold higher success rates in the real world. Additional materials can be found on: ut-austin-rpl.github.io/MimicDroid

Summary

  • The paper introduces a meta-training framework that uses context-target pairs from human play videos for few-shot humanoid manipulation.
  • The methodology employs similarity-based context retrieval and visual masking to bridge the human-robot visual gap and enhance action prediction.
  • Experimental results show significant improvements in success rates and reduced grasping errors across simulation benchmarks and real-world evaluations.

MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos

Introduction and Motivation

MimicDroid addresses the challenge of enabling humanoid robots to perform few-shot learning for manipulation tasks by leveraging in-context learning (ICL) from human play videos. The method is motivated by the need for scalable, diverse training data and rapid adaptation to novel objects and environments, which are not feasible with traditional teleoperated robot demonstrations due to their high cost and limited diversity. Human play videos, consisting of continuous, unscripted interactions, provide a rich source of task-agnostic data, capturing a wide range of manipulation behaviors and object configurations.

Methodology

MimicDroid's core innovation is the use of meta-training for ICL, where context-target pairs are constructed from human play videos. For each target segment, the method retrieves the top-kk most similar trajectory segments based on observation-action similarity, forming the context for in-context learning. The policy is trained to predict the actions of the target segment conditioned on the context, instilling the ability to exploit recurring observation-action patterns for rapid adaptation. Figure 1

Figure 1: MimicDroid's meta-training pipeline constructs context-target pairs from human play videos, applies visual masking to bridge the human-robot visual gap, and enables in-context learning for humanoid manipulation.

To overcome the embodiment gap, MimicDroid retargets human wrist poses estimated from RGB videos to the humanoid robot, leveraging kinematic similarity. Visual masking is applied during training to reduce overfitting to human-specific cues, improving transferability to robot embodiments. Figure 2

Figure 2: Examples of target and retrieved context segments, illustrating the similarity-based retrieval mechanism for constructing in-context learning samples.

Simulation Benchmark and Evaluation Protocol

A novel simulation benchmark is introduced to systematically evaluate few-shot learning for humanoid manipulation. The benchmark comprises 8 hours of play data across 30 objects and 8 kitchen environments, with evaluation structured into three levels of increasing difficulty:

  • L1: Seen objects and environments
  • L2: Unseen objects, seen environments
  • L3: Unseen objects and environments

This structure enables rigorous assessment of generalization capabilities across both abstract (free-floating hand) and humanoid (GR1) embodiments. Figure 3

Figure 3: Simulation benchmark overview, showing the diversity of objects, environments, and the systematic evaluation protocol for generalization.

Implementation Details

The policy model is built on a long-context transformer backbone with modality-specific encoders for vision and proprioception. Training samples consist of three context trajectories and one target trajectory, with each trajectory processed into observation-action pairs using hand pose estimation (WiLoR) and visual feature extraction (DinoV2). Random patch masking is applied to input images with high probability, and the policy predicts action chunks to model the multimodal nature of human play data.

Key implementation parameters include:

  • Context retrieval: Top-kk similar segments (k=10k=10 for training, kk varied for ablation)
  • Action prediction: 32-step action chunks per timestep
  • Visual masking: 1–16 random patches per image, p=0.8p=0.8
  • Training hardware: 8×A5000 GPUs, 200 epochs

Experimental Results

Generalization and Few-Shot Learning

MimicDroid demonstrates superior generalization through ICL compared to task-conditioned baselines (Vid2Robot, H2R) and parameter-efficient fine-tuning (PEFT). In simulation, MimicDroid achieves up to 0.73 success rate in L1 (abstract), 0.44 in L2 (humanoid), and 0.27 in L3 (abstract), outperforming all baselines. In real-world evaluations, MimicDroid attains a twofold improvement over Vid2Robot, with success rates of 0.53 (L1), 0.23 (L2), and 0.08 (L3). Figure 4

Figure 4: MimicDroid achieves higher success rates and reduces task misidentification and grasping errors compared to baselines, highlighting the efficacy of in-context learning.

Ablation Studies

  • Number of in-context examples: Performance increases with more context examples, plateauing beyond three due to training-time context length constraints.
  • Context retrieval (kk): Performance benefits from more retrieved segments but degrades with excessive noise at high kk.
  • Training data scaling: Consistent improvements are observed as training data increases, with L1 and L2 showing +24% absolute gains when scaling from 128k to 320k frames. Figure 5

Figure 5

Figure 5

Figure 5: Performance trends with number of in-context examples, context retrieval size, and training data scaling, demonstrating the importance of data diversity and context selection.

Visual Gap Bridging

Visual masking is critical for transferability. Removing masking results in a sharp drop in performance (17%-17\%) when transferring to the humanoid embodiment, compared to only 3%-3\% with masking. Random patch masking matches the performance of hand-specific masking strategies (EgoMimic) without requiring external segmentation modules.

Failure Analysis

Failure cases are dominated by task misidentification (26%), missed grasps (16%), and other errors (8%). MimicDroid reduces both misidentification and grasping errors compared to Vid2Robot, but struggles with novel motion sequences in L3 and overfits to specific hand sizes in cluttered environments.

Real-World Generalization

MimicDroid generalizes to both seen and unseen objects and environments in real-world evaluations, performing pick-and-place and articulated object manipulation tasks with high data efficiency. Figure 6

Figure 6: Real-world evaluation examples, showing MimicDroid's ability to generalize to novel objects and environments using few-shot in-context learning.

Limitations and Future Directions

Current limitations include reliance on high-quality human play videos, action extraction via hand pose estimation (which fails under occlusion), and lack of semantic generalization across task variants. Future work should explore augmentation with web-scale human videos, integration of full-body motion estimation, and meta-training with language-trajectory pairs to enable semantic generalization.

Conclusion

MimicDroid establishes a scalable framework for few-shot humanoid manipulation via in-context learning from human play videos. By leveraging meta-training on context-target pairs, retargeting human wrist poses, and applying visual masking, MimicDroid achieves strong generalization and data efficiency in both simulation and real-world settings. The method sets a foundation for future research in adaptive robot assistants capable of learning from diverse, unscripted human interactions.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 posts and received 114 likes.