Learning Unified Force and Position Control for Legged Loco-Manipulation
(2505.20829v1)
Published 27 May 2025 in cs.RO
Abstract: Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios.
This paper presents a novel approach for training a unified force and position control policy for legged robots using reinforcement learning, without requiring explicit force sensors. The core idea is to learn a policy that estimates external forces based on the robot's historical states and adjusts target positions and velocities to compensate, enabling versatile loco-manipulation behaviors.
The approach is based on an impedance control formulation. For the end-effector, the desired target position Xtarget is defined as:
Xtarget=Xcmd+KFext+(Fcmd−Freact)
where Xcmd is the position command, Fcmd is the force command, Freact is the environment reaction force, Fext is the external disturbance force, and K is the stiffness. The net force F=Fext+Freact is estimated by the learned policy. This formulation allows derivation of various behaviors depending on the commands and interactions:
Position Control: When Fcmd=0 and no significant Fext or Freact, Xtarget≈Xcmd, leading to position tracking.
Force Control: When in contact (Freact=0) and applying Fcmd, the end-effector moves until Freact≈Fcmd, resulting in Xtarget≈Xcmd.
Impedance Control: When subjected to Fext without applying Fcmd, Xtarget≈Xcmd+KFext, exhibiting compliant behavior where position deviates based on external force.
Force Tracking: A special case of impedance control where Xcmd is dynamically adjusted by ΔXcmd=KFext (following Equation A.4), causing the end-effector to follow external forces and remain in the displaced position after the force is removed.
Hybrid Position and Force Control: Combining position control in certain directions with force control in perpendicular directions.
This formulation is extended to other body parts, such as the robot base, typically controlled via velocity commands. The target base velocity Vbasetarget is derived similarly:
Vbasetarget=Vbasecmd+DFbase
where Vbasecmd is the base velocity command, Fbase is the net force on the base, and D is the damping.
The unified policy is learned using reinforcement learning (PPO) in Isaac Gym (Makoviychuk et al., 2021). The policy architecture includes an observation encoder processing historical states, a state estimator predicting robot state and net external forces (F), and an actor outputting joint position targets (Figure 1a). The observation includes base orientation and velocity, joint positions and velocities, previous action, commands (Vbasecmd, Xeecmd, Feecmd, Fbasecmd), and feet timing.
To train the policy across diverse scenarios, random position, velocity, and force commands are sampled, along with external disturbance forces applied to the end-effector and base (Figure 1d, Appendix A.3.1). Forces are ramped up, held, and ramped down according to a schedule. Training uses a two-stage curriculum (Appendix A.3.3): initially focusing on reaching and locomotion, then introducing force commands and disturbances, which empirically provides more stable training. The reward function penalizes deviations from target end-effector positions and base velocities, collisions, joint limits, torques, velocities, accelerations, and action rates, while rewarding stable contact during locomotion (Table A.1). Domain randomization is applied to friction, body mass, center of mass, motor strength, payload, and external pushes (Table A.2) to improve robustness and sim-to-real transfer.
A significant application demonstrated is force-aware imitation learning. Recognizing the lack of contact information in typical position-based datasets, the learned force-position policy is used as a teleoperation base (Figure 1b). During teleoperation, joint states, base states, commands, estimated end-effector contact forces, and camera images are recorded (Appendix A.2, Figure A.2). This data is then used to train a diffusion-based imitation learning policy that takes robot state, estimated force, and images as input and predicts position and force commands for the low-level policy. Including the estimated force provides crucial contact information, enhancing performance in contact-rich tasks.
The paper evaluates the learned policy through various experiments on a Unitree B2-Z1 quadrupedal manipulator and a Unitree G1 humanoid robot.
Position Tracking: In simulation, the policy achieves end-effector position tracking errors mostly within 0.1m (Figure 2b), with state estimation errors within 0.05m (Figure 2a). Tracking error slightly increases but remains within 0.1m under simulated forces matching force commands (Figure 2c).
Force Control: Real-world tests on the B2-Z1 show average force errors within 10N when applying force commands from 0 to 60N (Figure 2d, Appendix A.4, Figure A.3). Force estimation shows errors between 5-10N across discrete levels. While some sim-to-real discrepancies exist, particularly on the Y-axis, the estimation is deemed sufficient for the target tasks.
Force-aware Imitation Learning: On four real-world tasks (wipe-blackboard, open-cabinet, close-cabinet, open-drawer-occlusion), the force-aware policy significantly improves success rates by approximately 39.5% compared to a vision-only baseline (Figure 3c, Table A.3). For example, in wipe-blackboard, it maintains consistent contact pressure (Figure 3a), crucial for task success. For cabinet and drawer tasks with push-to-open mechanisms, force sensing helps detect the required contact force under visual occlusion (Figure A.4), which is difficult for vision-only policies, leading to higher success rates.
Basic Manipulation Policies: The paper demonstrates the policy's ability to perform force control (supporting weight with a force command, Figure 4a), force tracking (following external force with zero force command, Figure 4c), impedance control (compliant response to disturbances, Figure 4d), and base force tracking (yielding to pushes, Figure 4b), highlighting the versatility of the unified approach.
Cross Embodiment: The framework generalizes to both quadruped and humanoid robots, adapting base velocity for locomotion and balancing against external forces (Figure 1c, Figure 4b).
Practical limitations include potential degradation of force estimation accuracy in high-frequency interactions and workspace edges, and remaining sim-to-real gaps, especially in force accuracy. Future work could focus on improving force estimation using velocity/acceleration terms, enhancing sim-to-real transfer through domain randomization/real-to-sim corrections, and extending the framework to multi-point force estimation and whole-body contact tasks.
In summary, the paper presents a practical, sensor-free method for unified force and position control in legged robots via RL and state estimation, demonstrating its effectiveness in diverse manipulation and locomotion tasks and its potential for collecting richer, contact-aware imitation learning data.