Contact-Grounded Policy for Robots

Updated 3 July 2026

Contact-Grounded Policy (CGP) is a framework that explicitly models and uses contact data to drive robot actions in complex, dynamic scenarios.
It leverages multimodal sensing and structured contact representations to enable hybrid force-position control and compliant execution.
Empirical studies show CGP methods outperform vision-only policies, achieving higher success rates and enhanced safety in manipulation and locomotion.

Contact-Grounded Policy (CGP) is a unified paradigm for robot policy learning in contact-rich scenarios, distinguished by its explicit modeling, prediction, representation, or grounding of contact events, phases, or dynamics to inform perception, action selection, and compliant control. Originating in both manipulation and loco-manipulation contexts, CGP frameworks leverage multimodal sensing (vision, force/torque, tactile, kinematic), structured representations (contact events, contact plans, force profiles), and physically grounded policy architectures to achieve robust and generalizable behavior across a diverse set of contact-dense tasks.

1. Formal Definition and Theoretical Foundations

A Contact-Grounded Policy is any policy π_θ(a|s,C)—with parameters θ—that receives as input not only the robot’s or environment’s state s, but also structured, physically-meaningful contact information C (which may represent future or desired contact events, contact schedules, interaction frames, force/torque histories, or tactile feedback), and produces actions a that are consistent with contact-specific requirements. The mathematical instantiation of C varies by domain:

For contact-rich manipulation, C may be the future-contact schedule or force/torque trajectory, as in FoAR’s prediction of contact probability φ̂ₜ and selective force/vision fusion (He et al., 2024).
For locomotive/locomanipulation settings, C can be a contact plan or contact goal schedule $c = \{(p_i, \tau_i, e_i)\}_{i=1}^N$ , associating desired position, timing, and activating end-effector for each contact event (Omar et al., 4 Oct 2025, Ciebielski et al., 2024).
In dexterous or tactile mediation, C includes compact representations such as center-of-pressure (CoP) vectors, or joint state–tactile latent distributions (Pan et al., 27 May 2026, Xu et al., 5 Mar 2026).

The policy’s objective is typically formulated as a combination of action prediction error and contact consistency, as in:

$\min_\theta \mathbb{E}_{(s, a^*, \phi^*) \sim \mathcal{D}} [ \| \hat{a}(s;\theta) - a^* \|^2 + \alpha \, \ell_{BCE}(\hat{\phi}(s;\theta), \phi^*) ]$

where $\hat{a}$ are predicted actions and $\hat{\phi}$ are contact predictions (He et al., 2024).

Grounding in contact is further emphasized by direct inclusion of contact rewards, hybrid and structured action spaces (force-position impedance/admittance), and trajectory generation in contact-consistent latent spaces (Fang et al., 25 Feb 2026, Luo et al., 12 Apr 2026, Zhou et al., 2024).

2. Policy Architectures and Contact Representation

CGP instantiations differ in observation structure, network design, and form of contact conditioning:

A. Sensor Modalities and Inputs

Vision (RGB, depth, point cloud)
Force/torque (wrist FT, joint torque, external wrenches)
Proprioception (end-effector pose, joint states)
Tactile (taxel arrays, vision-based tactile images)
Contact event/plan (contact flags, positions, timings, scheduled sequence)

B. Structured Contact Inputs

Contact representations are made explicit in the policy input:

Future contact goal: next footfall, duration, and contact pose (Ciebielski et al., 2024, Omar et al., 4 Oct 2025)
Contact phase indicators and probabilities (He et al., 2024)
Dynamically inferred interaction frames decomposing motion/force control axes (Fang et al., 25 Feb 2026)
Multimodal concatenation and gating/fusion (e.g., via attention, FiLM, or product-of-experts) (He et al., 2024, Liu et al., 24 Feb 2025, Xu et al., 5 Mar 2026, Luo et al., 12 Apr 2026)

C. Policy Networks

Feed-forward or recurrent MLPs for joint state and contact-plan encoding (Omar et al., 4 Oct 2025)
Diffusion policy architectures for trajectory denoising and multimodal action chunking, with contact-aware fusion (He et al., 2024, Zhou et al., 2024, Xu et al., 5 Mar 2026, Luo et al., 12 Apr 2026)
Transformer-based fusion of vision and force/tactile, with cross-attention to enforce contact modality focus at phases of interest (Liu et al., 24 Feb 2025, Luo et al., 12 Apr 2026)
Lightweight vision-based impedance prediction heads for stiffness/admittance control (Wang et al., 14 Mar 2026)
Gated tri-modal fusions and FiLM conditioning for disentangling global intent (vision) and local refinement (force/tactile) (Fang et al., 25 Feb 2026, Luo et al., 12 Apr 2026, Zhou et al., 2024)

3. Control Strategies: Hybrid, Reactive, and Compliant Execution

A key tenet of CGP approaches is the explicit division and fusion of motion and force control, realized by:

Reactive hybrid position/force control: Selection based on contact phase or predicted contact state; e.g., FoAR’s phasewise fusion and error correction using predicted direction d for under-contacted phases (He et al., 2024).
Admittance/impedance control: CGP policies output target forces and poses, executed via admittance or impedance controllers that transform force errors into compliant kinematic adjustments (Zhou et al., 2024, Wang et al., 14 Mar 2026, Luo et al., 12 Apr 2026).
Instantaneous interaction frame estimation: Local bases constructed from the environmental stiffness matrix and demonstration-derived intent vectors enable per-axis selection of force versus position control (Fang et al., 25 Feb 2026).
Contact-consistency mappings: Learning explicit mappings from predicted tactile/kinematic pairs to control setpoints ensures that desired contact evolution is realized by compliant control (as in multi-finger dexterity) (Xu et al., 5 Mar 2026).

4. Training Procedures and Contact Grounding Mechanisms

CGP frameworks implement various training protocols tailored to grounding contact knowledge:

Imitation learning with contact loss: Behavior cloning via action and contact-prediction losses, with demonstration datasets augmented by haptic, tactile, or contact event labels (He et al., 2024, Liu et al., 24 Feb 2025, Zhou et al., 2024).
Reinforcement learning with contact-centric rewards: Structured contact-event rewards, e.g., alignment to reference contact event sequences, or contact plan tracking, with domain randomization and staged curriculum sampling (Omar et al., 4 Oct 2025, Kim et al., 29 Jun 2026).
Diffusion models and denoising objectives: Conditional generative models trained to produce entire trajectories (including force/tactile) consistent with multimodal observation and scheduled contact (Xu et al., 5 Mar 2026, Zhou et al., 2024).
Contact geometry optimization: In sim-to-real transfer, object/surface representations are adapted to optimize the match between simulated and real contact event sequences, increasing transfer fidelity and policy grounding (Kim et al., 29 Jun 2026).
Contact representation calibration: Closed-loop calibrations for physics-grounded tactile features (e.g., Center-of-Pressure) enable contact-driven policy conditioning in both sim and real (Pan et al., 27 May 2026).

5. Empirical Results: Performance, Robustness, and Generalization

Empirical analyses demonstrate that CGP methods significantly outperform vision-only or naively force-concatenating baselines in diverse, challenging scenarios:

Contact-rich manipulation and wiping: FoAR achieves up to full (100%) action success rate and nearly doubling of success/segment counts compared to prior art in wiping, peeling, and chopping, maintaining robustness under disturbances (He et al., 2024). FACTR yields 87.5% average test success (vs. 21.3% for vision-only) by enforcing curriculum-guided force attention (Liu et al., 24 Feb 2025). AdmitDiff reports 83% mean success and 48.8% mean force reduction (Zhou et al., 2024).
Dexterous, multi-point manipulation: CGP approaches that fully ground contact and tactile evolution outperform both visuotactile and visuomotor diffusion policies, with up to +15% success rate gain and strong tactile-kinematic tracking (Xu et al., 5 Mar 2026).
Locomotion and manipulation unification: Contact-explicit CGP policies generalize to multi-gait, multi-contact manipulation, with unity success in unseen gaits and 60% fewer manipulation failures in out-of-distribution objects/shapes (Omar et al., 4 Oct 2025, Ciebielski et al., 2024).
Force regulation and interaction safety: Stiffness Copilot dynamically adjusts robot impedance, matching low-stiffness collision safety with high-stiffness efficiency (Wang et al., 14 Mar 2026). Hybrid force-position policies using adaptive interaction frames excel in plug-in tasks, scrape-hard cases, and exhibit lower over-push error compared to baselines (Fang et al., 25 Feb 2026).
Sim-to-real transfer: Contact-grounded policies using CoP conditioning and contact-centric imitation achieve highest zero-shot transfer and resilience to observation degradation (e.g., masked taxels) (Pan et al., 27 May 2026, Kim et al., 29 Jun 2026).
Human-aligned, physically grounded data: Multimodal, CGP-compatible demonstration interfaces (e.g., OmniUMI) support high-fidelity force tracking and generalize to force-sensitive pick-and-place, erasing, and tactile-informed release (Luo et al., 12 Apr 2026).

6. Significance, Variations, and Limitations

Contact-Grounded Policy frameworks codify several distinct but complementary strategies for closing the action-perception loop in contact-dense settings:

Explicit contact modeling enables robust physical interaction, generalization across environmentally variable regimes, and efficient sim-to-real transfer. Unifying locomotion and manipulation through contact plan grounding has enabled scalable multi-task policy learning (Omar et al., 4 Oct 2025).
Adaptive, phase-aware multimodal fusion—key in policies like FoAR and FACTR—prevents force noise contamination during non-contact and focuses learning bandwidth on appropriate modalities (He et al., 2024, Liu et al., 24 Feb 2025).
Physically-grounded representation and compliance (e.g., via IF, CoP, tactiles) are necessary for fine, high-dimensional manipulation and safety-constrained real-world use (Fang et al., 25 Feb 2026, Pan et al., 27 May 2026).
Limitations include task/sensor specificity (contact-consistency mapping retraining), reliance on accurate labeling or event segmentation, and scalability to destructive or highly-variable contacts (Xu et al., 5 Mar 2026, Fang et al., 25 Feb 2026).

Potential future extensions involve end-to-end learning of contact scheduling, compliance parameters, multi-contact and bi-manual policy unification, and integration of explicit tactile/force planning in global policy architectures (He et al., 2024, Fang et al., 25 Feb 2026, Xu et al., 5 Mar 2026).

7. Representative CGP Frameworks and Results

CGP Family	Contact Representation	Core Contribution
FoAR (He et al., 2024)	Phase-specific force prediction, visual-force fusion	Reactive phaseful control, selective fusion
FACTR (Liu et al., 24 Feb 2025)	Curriculum-forced attention to force during training	Superior generalization to unseen objects
AdmitDiff (Zhou et al., 2024)	Diffusion over action and force trajectories	Admittance-based compliant execution
Force Policy (Fang et al., 25 Feb 2026)	Interaction frame estimation, hybrid control	Task-aligned IF enables OO-object generalization
CoP CGP (Pan et al., 27 May 2026)	Physics-grounded center-of-pressure (CoP)	Sim-to-real tactile transfer via differentiable calibration
CGP (Visuotactile) (Xu et al., 5 Mar 2026)	Generative prediction in tactile-kinematic latent space	Multi-fingered, multi-contact dexterity
Contact Plan CGP (Omar et al., 4 Oct 2025, Ciebielski et al., 2024)	Time-indexed event/goal schedule	Unified multi-gait and multi-task generalization
ConCent (Kim et al., 29 Jun 2026)	Contact event sequence as reward	Real-to-sim-to-real grounding of local contact geometry

This diversity illustrates that CGP is not a singular architecture, but an explicit design and training principle—grounding learning, perception, control, and policy at the granularity of contact, both spatially and temporally.