Contact-Rich Manipulation Tasks

Updated 12 June 2026

Contact-rich manipulation tasks are defined as robotic operations that integrate compliant force regulation and multimodal sensing to manage complex, multi-phase contact dynamics.
Recent advancements leverage visuo-tactile sensing, force/torque estimation, and large-scale datasets to precisely estimate state transitions and perform sub-millimeter task execution.
Physics-informed methodologies and adaptive control strategies are employed to overcome frictional constraints, ensure safety, and enhance overall task robustness in dynamic environments.

Contact-rich manipulation tasks comprise a class of robotic behaviors in which successful execution depends on strategic, compliant, and often multi-phase physical interaction with objects and environments, including sustained or dynamically changing surface contacts. These tasks—such as insertion, assembly, wiping, cutting, and dense packing—are characterized by complex contact dynamics, nontrivial frictional constraints, and frequent transitions between distinct contact modes. Effective solutions demand both precise perception (visuo-tactile, force/torque) and advanced control–planning methodologies capable of leveraging force information, modeling state transitions, and reasoning over contact-rich manifolds.

1. Defining Features and Physical Foundations

Contact-rich manipulation tasks fundamentally involve states and actions where the robot exploits or regulates contact forces to accomplish objectives that would be infeasible or unreliable via pure position control. Notable hallmarks include:

Multi-contact phenomena: Execution typically involves transitions between free-space motion, sticking, sliding, and separation. Contact configurations and corresponding constraint manifolds are hybrid and change discretely over time (Hegeler et al., 2023, Katayama et al., 2022).
Complex constraints: Tasks require explicit handling of kinematic, dynamic, and frictional constraints—including both holonomic and nonholonomic effects—across various contact modes.
Compliance and force regulation: High-precision contact-rich tasks are sensitive to small misalignments; compliant impedance or admittance control is essential to mitigate hardware compliance, unmodeled environment variations, and to maintain safety (Zhou et al., 2024, Li et al., 2022).
Sensing modalities: Whereas vision alone is insufficient due to occlusions and lack of surface force cues, tactile (e.g., GelSight, bio-inspired skins), force/torque (F/T), and audio/vibration signals are indispensable for alignment, slip detection, and dynamic response (Fan et al., 30 May 2025, Yu et al., 2023, Zheng et al., 19 Mar 2026, He et al., 2024).

2. Sensing, Perception, and Dataset Infrastructure

State-of-the-art contact-rich manipulation relies on the tight coupling of visual and tactile (including F/T) sensing. Several recent frameworks have advanced both hardware and large-scale dataset creation:

Vision-based tactile and multimodal grippers: MagicGripper integrates a high-resolution, elastomeric grid tactile sensor with proximity and RGB channels, offering spatial resolutions of ~0.15 mm and accurate force estimation (Fan et al., 30 May 2025). Xense, Daimon, and GelSight Mini have also been used for marker-based tactile imaging (Zheng et al., 19 Mar 2026).
Large-scale visuo-tactile-action datasets: The OmniViTac dataset provides >21,000 real-world trajectories covering 86 contact paradigms and >100 objects, enabling modeling of diverse interaction patterns such as assembly, wiping, peeling, and in-hand adjustment (Zheng et al., 19 Mar 2026).
Hand-held and robot-free data collection: ViTaMIn employs a compliant fin-ray gripper with dual visual-tactile cameras for hand-held, in-the-wild demonstration capture, coupled with visual–tactile contrastive pretraining (Liu et al., 8 Apr 2025).
Human tactile-guided demonstration: MimicTouch collects multi-modal tactile (GelSight), pose, and vibration data from human fingers directly, supporting learning of human-level “blindfolded” control strategies (Yu et al., 2023).

These developments have enabled robust multimodal state estimation, tactile feature learning, and prediction of short-horizon contact dynamics essential for high performance.

3. Modeling, Learning, and Policy Structures

Contact-rich manipulation presents a spectrum of modeling and learning challenges, addressed by varied approaches:

Visuo-tactile world modeling: OmniVTA predicts short-horizon contact developments and task transitions with a two-stream diffusion world model, integrating a self-supervised tactile encoder and a differentiable implicit representation (Zheng et al., 19 Mar 2026).
Deep visuo-tactile policies: ViTaL decomposes policy learning into a high-level vision-language-based reaching phase (e.g., with Molmo VLM) and a reusable local visuo-tactile policy, leveraging transformer architectures and residual reinforcement learning to yield sub-millimeter precision across task instances (Zhao et al., 16 Jun 2025).
Dense self-supervised rewards: DREM derives dense, progress-aligned reward functions by learning an embedding over multimodal observations (vision + F/T) and defining reward as progress along this manifold, enabling stable and efficient RL for insertion tasks (Wu et al., 2020).
Impedance and admittance control integration: Admittance Visuomotor Policy Learning employs diffusion-based planning to output both end-effector poses and desired contact forces, closed in real time via an admittance controller for efficient contact-phase regulation (Zhou et al., 2024).
Force-aware and reflexive control: FoAR combines real-time force/torque sensing, vision, and a learned future contact predictor to modulate force-feature usage dynamically and applies a reactive position control overlay for robust task-phase transitions (He et al., 2024). OmniVTA further incorporates a 60 Hz latent tactile reflexive controller to enable rapid error correction during unforeseen deviations (Zheng et al., 19 Mar 2026).

4. Planning, Control, and Constraint Handling

Physical and hybrid system modeling underpins robust contact-rich manipulation:

Implicit and explicit contact planning: For quasi-static tasks, planning is framed as an optimization over an implicit equilibrium manifold, with cost functions derived from physically meaningful haptic metrics (Yang et al., 2024). SE(2) global planning via mutual reachable sets and convex set graphs enables globally near-optimal motion sequences that sequence contact-rich transfer and regrasp transits (Liu et al., 15 Jan 2026).
Contact-implicit control formulations: Linear Complementarity Quadratic Programming (LCQP) enables real-time, online control by relaxing and enforcing quasi-static force balance and contact complementarity constraints (normal/tangential forces, frictional regime) (Katayama et al., 2022).
Constraint extraction from demonstration: Visual-only demonstration can be clustered into discrete contact modes, each represented by a holonomic kinematic constraint manifold (e.g., sphere, hinge, plane). Online residuals between measured forces and constraint force subspaces enable robust contact-phase detection and admittance-based control (Hegeler et al., 2023).
Compliance adaptation and anomaly handling: Adaptive impedance control, combined with predictive force modeling (e.g., BiGRU-MDN), achieves simultaneous force tracking and rapid compliance in the presence of contact perturbations, with anomaly detection via likelihood-based thresholds (Gao et al., 2020).

5. Sample Results, Benchmarks, and Applications

Contact-rich manipulation methods have demonstrated significant quantitative and qualitative advances across a spectrum of benchmarks:

Task	Method	Success Rate / Metric	Sensor Modality	Citation
Peg-in-hole/USB insertion	DREM (SAC+learned reward)	Peg ~90%, USB ~80%	Vision + F/T	(Wu et al., 2020)
Plug-in-socket, USB, assembly	ViTaL	90% in unseen scenes	Wrist RGB + tactile	(Zhao et al., 16 Jun 2025)
Fragile fruit, tube, scissors	ViTaMIn	70–100% (↑45% w/ tactile)	Vision + vision-based tactile	(Liu et al., 8 Apr 2025)
Assembly, dense packing	MimicTouch (policy MSE)	0.26 MSE (T+A); RL TBA	GelSight tactile + audio	(Yu et al., 2023)
Wiping, drawing, door, etc.	Admittance Diffusion (Ours)	Avg. 83% (+15%)	Vision + Force	(Zhou et al., 2024)
Many objects, six patterns	OmniVTA (Ours)	60–90% across patterns	Multi-tactile + vision	(Zheng et al., 19 Mar 2026)
Flexible object (zipper)	Deep Predictive (ours)	93% (w/ tactile)	Vision + tactile grid	(Ichiwara et al., 2021)

Further core demonstrations:

MagicGripper achieves sub-millimeter contact localization and ~0.05 N force RMSE in plug insertion, object alignment, and slip detection (Fan et al., 30 May 2025).
FILIC demonstrates force-guided imitation learning via sensorless F/T estimation, achieving 80–90% insertion in tasks previously limited by torque or vision-only policies (Ge et al., 21 Sep 2025).
Contact SLAM achieves <5 mm localization error in occluded (blind) socket assembly using only tactile exploration and factor-graph filtering (Wang et al., 11 Dec 2025).

6. Challenges, Limitations, and Ongoing Directions

Persistent open questions and current research frontiers include:

Embodiment gap and transfer: Significant performance drops are observed when transferring from human tactile demonstrations (e.g., MimicTouch VINN policy) to robot execution. Addressing morphological and kinematic mismatch—through residual RL, adaptation, or domain randomization—is a major area of active work (Yu et al., 2023).
Sample and demo efficiency: Data-efficient learning is advanced by pretraining (ViTaMIn), large-scale datasets (OmniViTac), and implicit or self-supervised reward modeling. However, hard-to-acquire demonstrations, especially in the presence of task variability or occlusions, remain a bottleneck (Zheng et al., 19 Mar 2026, Liu et al., 8 Apr 2025).
Generalization and robustness: Scene and pose generalization now reach ~90% success in localize-then-execute stratified policies (ViTaL); tactile features are essential for robustness against unobserved scene changes or dynamic perturbations. Adaptive gating and planning over implicit hybrid manifolds are ongoing research topics (Zhao et al., 16 Jun 2025, He et al., 2024).
Hardware and sensor limitations: Sensorized grippers must balance compliance, spatial and force resolution, and durability under high cycle counts and mechanical stress. Robustness to fabrication and calibration variances is nontrivial, but progress in elastomeric and grid-based tactile sensing is closing the performance gap (Fan et al., 30 May 2025).

7. Synthesis and Emerging Methodological Principles

Current research in contact-rich manipulation emphasizes physically-grounded learning, multimodal perception, and closed-loop force-aware planning:

Physics-informed architectures: Differentiable simulators, constraint-extraction from demonstration, and hybrid (mode-switching) control generate physically valid and efficient behaviors (Katayama et al., 2022, Liu et al., 15 Jan 2026).
Multimodal fusion and reflexive control: Integrating fast tactile feedback (≥60 Hz), vision, and force/torque at both planning and control time scales—for example, through fusing world-model predictions with fast reflexive corrections—proves essential for stability and generalization under both nominal and off-nominal conditions (Zheng et al., 19 Mar 2026, He et al., 2024).
Adaptive behaviors: Modern frameworks employ context- and phase-aware policy adaptation, switching between force- and visually-dominated control as the task phase or predicted contact probability changes. This adaptability underpins recent advances in both success rate and safety (Zhou et al., 2024, He et al., 2024).

Contact-rich manipulation has evolved into a rigorous, multidisciplinary field integrating geometry, hybrid system theory, physically-motivated learning, high-bandwidth multimodal sensing, and compliant control. Continued progress—especially towards robust, general-purpose solutions—rests on scalable datasets, sim-to-real transfer techniques, and the principled fusion of physical and data-driven intelligence.