Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 188 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking (2504.15414v1)

Published 21 Apr 2025 in cs.RO and cs.LG

Abstract: Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.

Summary

Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

The paper presents a novel approach to address the post-convergence transfer problem of reinforcement learning (RL) policies from simulation to real-world settings. The authors critique the prevalent practice of "cherry-picking" policies for deployment after convergence, where empirical reward trajectories in the simulator become noisy yet stable but do not necessarily correlate with real-world performance. Instead of relying on heuristic selections, the paper introduces a more systematic and analytical framework to improve sim-to-real policy transfer.

Problem Context

RL is extensively used for developing control policies in autonomous agents such as legged robots. During the training phase, RL policies are optimized in a simulator by maximizing predefined reward functions. Once the trained policies reach convergence, selecting a policy to deploy in real environment does not merely involve choosing the one yielding highest reward in simulation, due to unpredictability in recreated real-world conditions. While much of the research effort focuses on optimizing pre-convergence training and reducing sim-to-real gap—for instance, by employing domain randomization or devising high-fidelity simulators—these methods do not directly resolve the uncertainty inherent in post-convergence policy selection.

Methodological Approach

To tackle the issue of policy selection post-convergence, the authors propose an innovative solution: evaluating worst-case performance of RL policies under simulated conditions, thereby establishing performance guarantees that are provably better indicators of true real-world behavior. Their methodology involves a convex optimization modeling approach, specifically a Quadratic Constrained Linear Programming (QCLP) formulation that considers distributional divergence constraints between the simulator and possible real-world outcomes. The result is a solution that systematically identifies worst-case performance expectations which can reliably predict the relative real-world rankings of different policies trained through simulation.

Key Findings and Implications

Extensive empirical evaluations demonstrate that the proposed worst-case estimator aligns more closely with real-world policy performance than traditional simulation-based metrics. Using locomotion policies for humanoid robots in different environments, the authors illustrate the efficacy of their approach in diverse settings such as undisturbed and disturbed conditions. Of particular significance is the theoretical underpinning that assures reduced variance in estimated worst-case rewards, leading to improved ranking consistency even when direct simulations fail to capture real-world complexities.

This approach poses significant implications for both practical application and theoretical exploration. Practically, it offers a structured framework that could be applied across diverse domains where policy transfer from simulation to real-world scenarios is critical, such as autonomous vehicles and robotics. Theoretically, it invites further examination into advanced optimization techniques and adaptive methods to overcome computational challenges arising from high-dimensional state spaces and complex dynamics.

Future Directions

This work opens several avenues for further research, including exploration of more sophisticated adversarial methodologies for simulating realistic disturbances, and investigating adaptive discretization techniques to efficiently handle large-scale optimizations. Additionally, refining the balance between computational feasibility and model fidelity remains a key challenge for robust policy transfer. While this paper provides a foundational step towards formalizing post-convergence policy selection, continued efforts to enhance the modeling efficiency and real-world applicability are highly encouraged.

In summary, this paper contributes a principled alternative to heuristic policy selection post-convergence, providing significant insights that could reshape practices in sim-to-real policy transfer.