SafeMimic: Towards Safe and Autonomous Human-to-Robot Imitation for Mobile Manipulation (2506.15847v1)

Published 18 Jun 2025 in cs.RO and cs.AI

Abstract: For robots to become efficient helpers in the home, they must learn to perform new mobile manipulation tasks simply by watching humans perform them. Learning from a single video demonstration from a human is challenging as the robot needs to first extract from the demo what needs to be done and how, translate the strategy from a third to a first-person perspective, and then adapt it to be successful with its own morphology. Furthermore, to mitigate the dependency on costly human monitoring, this learning process should be performed in a safe and autonomous manner. We present SafeMimic, a framework to learn new mobile manipulation skills safely and autonomously from a single third-person human video. Given an initial human video demonstration of a multi-step mobile manipulation task, SafeMimic first parses the video into segments, inferring both the semantic changes caused and the motions the human executed to achieve them and translating them to an egocentric reference. Then, it adapts the behavior to the robot's own morphology by sampling candidate actions around the human ones, and verifying them for safety before execution in a receding horizon fashion using an ensemble of safety Q-functions trained in simulation. When safe forward progression is not possible, SafeMimic backtracks to previous states and attempts a different sequence of actions, adapting both the trajectory and the grasping modes when required for its morphology. As a result, SafeMimic yields a strategy that succeeds in the demonstrated behavior and learns task-specific actions that reduce exploration in future attempts. Our experiments show that our method allows robots to safely and efficiently learn multi-step mobile manipulation behaviors from a single human demonstration, from different users, and in different environments, with improvements over state-of-the-art baselines across seven tasks

Summary

The paper presents SafeMimic, a novel framework enabling mobile manipulation robots to safely and autonomously learn complex skills from a single human video demonstration.
SafeMimic processes video demos using vision-language models and human tracking, employs safety Q-functions for safe exploration with backtracking, and refines actions via a policy memory.
Experimental results show SafeMimic outperforms baselines in safety and efficiency across diverse tasks and environments, significantly reducing unsafe actions and the need for extensive robot training.

Overview of SafeMimic: Autonomous Learning for Mobile Manipulation Through Human-to-Robot Imitation

The paper, "SafeMimic: Towards Safe and Autonomous Human-to-Robot Imitation for Mobile Manipulation," presents a novel framework designed to safely and autonomously enable robots to learn mobile manipulation skills from a single third-person human video demonstration. With the objective of facilitating robots as efficient helpers in domestic settings, SafeMimic addresses the challenging task of imitation through an advanced methodology that mitigates the need for extensive human supervision.

Core Components and Methodology

SafeMimic operates by first parsing a human video demonstration into distinct segments, which it further processes to deduce the semantic changes and associated human motions. The methodology comprises the following stages:

Video Parsing and Translation: The human demonstration is divided into navigational and manipulative segments using advanced human motion tracking and vision-LLMs (VLMs). This step derives both the intended task and the sequence of physical actions. The parsing framework thus generates an initial motion plan while translating third-person perspectives into first-person actions suitable for the robot’s morphology.
Safe Exploration and Adaptation: SafeMimic harnesses safety Q-functions, pre-trained in simulation, to govern and explore candidate actions from the parsed human demonstration. This ensemble approach facilitates the verification of actions to ensure safety prior to execution, employing a receding horizon planning strategy for continuous adaptation. Unique to this approach is the capability to backtrack upon recognizing dead-ends, which allows the robot to attempt alternative strategies autonomously.
Action Refinement and Learning: Output from the exploration stage feeds into a policy memory module, where SafeMimic records successful action sequences. This enables subsequent attempts to be more efficient by reducing unnecessary exploration. The learning and adaptation are bolstered by actions and semantic nuances gathered from the video parsing phase.

Experimental Validation and Results

The framework was evaluated across seven complex mobile manipulation tasks in diverse environments. The experiments demonstrate that SafeMimic not only surpasses baseline methods in efficiency and safety but also exhibits adaptability to variations in task settings, demonstrating proficiency across different human users and environments. Importantly, the implementation of the safety Q-functions significantly decreased the incidence of unsafe actions by accurately predicting potential risks and optimizing safe action sequences.

Implications and Future Directions

SafeMimic represents an advancement in the capability of robots to learn complex tasks from minimal human input, with significant implications for enhancing robotic autonomy in human environments. By achieving skill acquisition from single demonstrations, the framework advances towards reducing the cost and complexity traditionally associated with robot training regimes.

Future research directions might include expanding the state space of safety Q-functions to accommodate additional failure modes and exploring integration with other learning paradigms to enhance generalization across broader task domains. Furthermore, leveraging large-scale simulated environments for pretraining, while refining real-world adaptation procedures, could open avenues for broader applicability in unstructured settings.