Robotic Policy Learning via Human-assisted Action Preference Optimization (2506.07127v2)

Published 8 Jun 2025 in cs.RO and cs.AI

Abstract: Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their dependence on expert demonstrations hinders the crucial capabilities of correction and learning from failures. To mitigate this limitation, we introduce a Human-assisted Action Preference Optimization method named HAPO, designed to correct deployment failures and foster effective adaptation through preference alignment for VLA models. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. These human-intervention trajectories are further employed within the action preference optimization process, facilitating VLA models to mitigate failure action occurrences while enhancing corrective action adaptation. Specifically, we propose an adaptive reweighting algorithm to address the issues of irreversible interactions and token probability mismatch when introducing preference optimization into VLA models, facilitating model learning from binary desirability signals derived from interactions. Through combining these modules, our human-assisted action preference optimization method ensures reliable deployment and effective learning from failure for VLA models. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our framework across a variety of manipulation tasks.

Summary

The paper introduces HAPO, a framework that integrates human feedback to distinguish desirable from undesirable actions in Vision-Language-Action models.
It employs an adaptive reweighting algorithm using binary desirability signals to refine action preference optimization for better task performance.
Experimental results demonstrate HAPO's superior generalization and robustness, highlighting its potential for continuous improvement in dynamic robotic environments.

Overview of Robotic Policy Learning via Human-assisted Action Preference Optimization

The paper, "Robotic Policy Learning via Human-assisted Action Preference Optimization," offers a novel approach to refining Vision-Language-Action (VLA) models, crucial for the effective deployment of robotic systems in real-world environments. The authors highlight the limitations of traditional VLA models, notably their reliance on expert demonstrations, which inhibits the systems' capacity for correction and adaptation from failures. This paper introduces the Human-assisted Action Preference Optimization (HAPO) method to address these challenges.

Key Contributions

The HAPO framework comprises two core components designed to enhance the performance and learning capacity of VLA models:

Human-Robot Collaboration Framework: This aspect of the framework facilitates reliable task execution through real-time human interventions. By employing a human-in-the-loop mechanism, the system can gather interaction trajectories that feed into an action preference optimization process. Notably, the system distinguishes between desirable and undesirable actions, annotated based on human intervention, providing a richer dataset for model training.
Action Preference Optimization with Adaptive Reweighting: The paper proposes an innovative adaptive reweighting algorithm to tackle the issues of irreversible interactions and token probability mismatches in autoregressive VLA models. The optimization process leverages binary desirability signals and employs Kanheman {content} Tversky’s prospect theory to formulate a preference alignment objective. This method allows VLA models to learn from sub-optimal human interventions, promoting failure avoidance and the adoption of corrective actions.

Technical Details

The process begins with the collection of expert demonstration data, which is used to fine-tune an initial policy model. Human interventions during policy execution are classified into desirable and undesirable actions, with an emphasis on the latter to provide critical learning opportunities. Using a combination of balanced sampling and adaptive reweighting, the optimization process aligns the model's action preferences towards successful task completions.

This approach contrasts with traditional behavior cloning and reinforcement learning methods by mitigating their inherent limitations. Behavioral cloning often fails to utilize valuable failure trajectories, while reinforcement learning methods face scalability constraints when applied to large VLA models. HAPO, by focusing on preference optimization, offers a promising alternative that reduces the need for large-scale fine-tuning and minimizes human intervention in training large policies.

Experimental Validation

The efficacy of the HAPO approach is validated through both simulated and real-world experiments. Simulation results demonstrate superior generalization and robustness across various manipulation tasks, with the system showing improved adaptation to in-distribution scenarios and resilience against novel task disruptions. Lifelong learning experiments further demonstrate the framework’s capability for iterative improvement, highlighting its potential for continuous adaptation and performance enhancement.

Moreover, the research includes real-world experiments, notably on complex fine-grained tasks, confirming the practical viability of HAPO in dynamic, unconstrained environments. This indicates the framework's robustness to perform across a range of conditions and tasks.

Implications and Future Directions

The introduction of HAPO opens new avenues for enhancing robotic systems' deployment in real-world applications through increased adaptability and self-improvement capabilities. By integrating human feedback more effectively, this approach not only addresses existing challenges in VLA model deployment but also sets the stage for future developments in AI that leverage human-machine collaboration.

In terms of future work, there is scope for extending HAPO to different types of VLA models and exploring its applicability to broader categories of tasks beyond manipulation. Additionally, incorporating more sophisticated human feedback mechanisms could further improve the performance and adaptability of robotic systems. The research sets a precedent for future exploration into more efficient and effective interactions between human operators and intelligent robotic systems.