Vision-Based Force Estimation

Updated 27 January 2026

Vision-Based Force Estimation is a methodology that infers contact forces from visual cues like RGB images, depth maps, and optical flow, eliminating dedicated force sensors.
Techniques employ advanced imaging, CNNs, transformers, and sensor fusion to map deformation cues to force vectors through direct regression or physics-based models.
This approach facilitates real-time robotic interaction, tactile sensing, and dynamic manipulation by providing scalable, non-intrusive feedback across diverse applications.

Vision-Based Force Estimation refers to the class of methodologies in which visual data—ranging from standard RGB images and depth maps to optical flow fields and event-based vision or marker morphology—are leveraged to infer applied contact forces. This paradigm circumvents the need for dedicated hardware such as electrical strain-gauge force sensors by extracting deformation cues observable through vision modalities, enabling scalable, non-intrusive, and generalized force feedback or haptic reasoning across robotics, teleoperation, tactile sensing, and dynamic interaction with deformable matter.

1. Sensor Modalities, System Architectures, and Problem Formulations

Vision-based force estimation systems fall broadly into several architectural categories according to their data modalities and physical interaction domains:

Tactile Sensors with Internal Cameras: Devices such as GelSight, DIGIT, 9DTact, and markerless visuotactile sensors incorporate a gel or elastomer layer deformed under contact and viewed by embedded cameras with structured illumination. The deformation patterns (e.g. photometric shading, marker displacement, scattered light) encode the spatial distribution and magnitude of force (Castaño-Amoros et al., 2024, Lin et al., 2023, Shahidzadeh et al., 2024).
External Observational Setups: Approaches such as VFTS utilize external cameras (e.g. fisheye, stereo, or event-based sensors) to monitor compliant manipulators or robot end-effectors. The observed global deformations or silhouette dynamics are regressed to 6-axis force/torque vectors (Collins et al., 2022, Guo et al., 2024).
Image-to-Force for Deformable Tissue and Exosuits: Structured light, frequency-domain optical flow (SurgeMOD), or multi-view setups reconstruct high-resolution 3D deformation fields. These are mapped to force estimates via learned regression or analytic dynamic constraints, enabling force prediction in surgical manipulation or wearable assistive exosuits (Wang et al., 15 Jan 2025, Reyzabal et al., 2024, Refai et al., 4 Aug 2025).
Hybrid Vision+State Systems: In surgical and telemanipulation contexts, visual inputs are fused with robot kinematics, motor currents, or state vectors (velocity, joint position), either as explicit model inputs or via late fusion in neural encoder-decoder models (Chua et al., 2020, Reyzabal et al., 2024, Annamraju et al., 29 Apr 2025, Yang et al., 2024).
Force-Map and Field Estimation: Coarse-grained force-distribution or "force-map" regressors, trained purely in simulation with domain randomization, provide spatially resolved—though approximate—contact load/distribution predictions from single or multi-view images, facilitating robust manipulation planning (Hanai et al., 2023).

Across all architectures, the mapping from visual observations to forces can be formulated as direct regression (e.g. mapping gel images to scalar/vector force), structured estimation (e.g. per-voxel or per-marker force), or constrained optimization (e.g. enforcing stiffness, dynamic constraints, or modal structure).

2. Core Algorithms and Learning Pipelines

The majority of state-of-the-art vision-based force estimators rely on supervised learning with ground-truth force labels. The key elements are:

Input Representations: These range from raw RGB or grayscale images, depth reconstructions via photometric stereo, synthetic images via domain adaptation, marker fields, or compressed representations (e.g. event-frame composites). For instance, 9DTact computes and stacks "darker/bright" deformation maps, while event-based systems (Force-EvT) aggregate asynchronous polarity events into image frames (Guo et al., 2024, Lin et al., 2023).
Feature Extraction:
- CNNs and ResNets for spatial feature extraction, often with multi-scale feature fusion (e.g. RGBmod combines features from multiple ResNet layers for robust force prediction (Castaño-Amoros et al., 2024)).
- Transformers (ViT, DINOv2 backbones in FeelAnyForce and Force-EvT) are leveraged for global context, especially when processing dense event or marker field inputs (Guo et al., 2024, Shahidzadeh et al., 2024).
- GNNs (e.g. DeepLabCut keypoints fused via GraphSAGE for tool-pose/position estimation in surgical scenes (Yang et al., 2024)).
Regression and Loss Functions:
- Mean squared error (MSE) and L1 loss dominate for direct force regression (Wang et al., 15 Jan 2025, Lin et al., 2023).
- Multi-objective heads for multi-task learning (FeelAnyForce's force+depth objectives (Shahidzadeh et al., 2024)).
- Regularization via dropout, weight decay, and cycle consistency or identity losses in domain adaptation (TransForce (Chen et al., 2024)).
Domain Transfer and Cross-Sensor Calibration:
- CycleGAN-based domain translation transfers style and illumination cues to support force estimation on novel visuotactile sensor hardware, with only minimal fine-tuning (e.g. 100-sample calibrations in FeelAnyForce, TransForce (Shahidzadeh et al., 2024, Chen et al., 2024)).
Calibration and Model-Based Fusion:
- Some systems fit polynomial or physically-parametrized relationships post-hoc (DIGIT force as cubic polynomial of maximum gel depth (Zhu et al., 2022)).
- Others hybridize with analytical models (e.g., fusion of motor currents, vision-based regression, and force/torque sensors via Kalman filtering (Annamraju et al., 29 Apr 2025)).

3. Quantitative Performance and Evaluation

Vision-based force estimation achieves competitive accuracy suitable for real-time robot interaction feedback:

System/Domain	Error Metric	Value / Range	Comments
9DTact (markerless, 6D)	MAE (N, N·m)	0.3–0.4 N, ~0.01 N·m	Generalizes to unseen objects (Lin et al., 2023)
FeelAnyForce (GelSight Mini)	Mean normalized L1 error (%)	4.2 % (unseen objects)	200 K samples, ViT-Base backbone (Shahidzadeh et al., 2024)
Grasping (RGBmod/DIGIT)	Mean rel. error	0.125 ± 0.153 (everyday objects)	No markers, 320×240 RGB (Castaño-Amoros et al., 2024)
Force-EvT (event camera, soft gripper)	RMSE, mean percent error	0.13 N, 13 %	ViT-Base, 1.6 N range (Guo et al., 2024)
VFTS (external, 6-axis)	RMSE	1.688 N, 0.185 N·m	Outperforms motor currents (Collins et al., 2022)
DaFoEs (surgical tool)	Relative error (mixed dataset)	5 % (recurrent), 12 % (non-rec)	Cross-domain, ViT/LSTM (Reyzabal et al., 2024)
Force Map (object stacks)	% reduction in disturbance	–26 % (translation), –39 % (rotation)	Simulation-only, rough map (Hanai et al., 2023)
CNN Exosuit (sim, 7 pts)	RMSE, normalized RMSE (%)	0.04 N, ~2.7 %	Closed-loop on soft exosuits (Refai et al., 4 Aug 2025)
Tac3D (binocular vision, force field)	Displacement RMSE (mm); Force–sensor match	<0.03 mm; close fit	Markers, real time (Zhang et al., 2022)

The performance is consistently within the regime required for feedback, slip detection, manipulation planning, and haptic rendering. Performance degrades gracefully under novel objects, sensor variants, or increased material stiffness but can be mitigated via limited fine-tuning and domain adaptation.

4. Special Topics: Physical Modeling, Force Distribution, and Dynamic Interaction

Model-Driven vs. Pure Learning Approaches: Some systems derive physically-motivated constraints, e.g. force = stiffness × displacement via estimated local tool–tissue position (Vision+State Fusion (Yang et al., 2024)), frequency-domain modal dynamics (SurgeMOD (Reyzabal et al., 2024)), or compliance matrix inversion (Tac3D FEM-based inversion (Zhang et al., 2022)).
Force Distribution and Friction Mapping: High-resolution marker-based or photometric-stereo systems enable spatially-resolved estimation of force (as a vector field) and friction coefficient distribution, which supports advanced planning and slip prevention (Zhang et al., 2022, Zhang et al., 2019).
Sensorless Estimation and Multi-Domain Generalization: The utility of dataset-mixing and kinematically-aligned data augmentation for robust force estimation across surgical tools, phantom structures, and variable workspace configurations is demonstrated in DaFoEs (Reyzabal et al., 2024).
Temporal and Sequential Modelling: LSTM or transformer models employed in TransForce, DaFoEs, and FeelAnyForce improve accuracy in the presence of dynamic deformation (slip, shear, impact), especially in shear- or tangential-force channels (Chen et al., 2024, Reyzabal et al., 2024, Shahidzadeh et al., 2024).

5. Limitations, Robustness, and Current Frontiers

Modalities and Generalization: Vision-based estimators are often biased toward normal-force estimation due to accessibility of intensity-based cues, while shear or torque estimation requires informative marker displacement or sophisticated temporal models (Chen et al., 2024, Shahidzadeh et al., 2024). Performance is reduced on stiff/deeply occluded contacts or in cross-sensor transfer without adaptation.
Calibration and Adaptation: Minor gel-manufacturing differences, marker variations, or illumination changes necessitate either (a) robust domain adaptation via style-transfer and sequential translation (Chen et al., 2024, Shahidzadeh et al., 2024) or (b) physics-inspired recalibration pipelines (Zhu et al., 2022, Zhang et al., 2022).
Real-Time and Resource Constraints: Many pipelines (ResNet-18, RGBmod, Densenet-169) achieve >25–100 Hz inference on commodity hardware, whereas transformer or sequential models trade off accuracy for lower throughput (e.g. 12.5 Hz for RCNN, RViT in DaFoEs (Reyzabal et al., 2024)).
Physical Coverage: Current systems assume quasi-static or small-strain elastic interactions. High-speed impacts, viscoelastic-dominated regimes, or deformable object interactions often remain unsolved or require future exploration (dynamic datasets, real-world exosuits, multi-tissue types) (Refai et al., 4 Aug 2025).

6. Scientific and Application Impact

Vision-based force estimation is central to a wide range of domains, including:

Robotic adherence and slip prevention (grippers, manipulation) (Collins et al., 2022, Lin et al., 2023)
Haptic feedback for teleoperation and medical robotics (surgical robots, tissue palpation, exosuit actuation) (Chua et al., 2020, Zhu et al., 2022, Refai et al., 4 Aug 2025)
Adaptive grasping and planning based on force-map and friction distributions (Hanai et al., 2023, Zhang et al., 2022, Zhang et al., 2019)
Multimodal fusion for robust contact inference in the absence of ground-truth sensors (Yang et al., 2024, Annamraju et al., 29 Apr 2025)
Rapid calibration and deployment across custom sensor hardware (Shahidzadeh et al., 2024, Chen et al., 2024)

A plausible implication is that purely visual force estimation, with robust generalization and lightweight learning architectures, will further displace traditional sensing modalities in settings where embedded hardware is impractical or cost-prohibitive.

7. Ongoing Research Directions

Temporal and Multi-View Fusion: Transformer-style temporal models, multi-view vision (stereo, event, optical flow), and fusion with inertial data to improve coverage of dynamic or complex-contact scenarios (Guo et al., 2024, Reyzabal et al., 2024, Reyzabal et al., 2024).
Dense Physical Field Estimation: Learning on dense force maps for physically grounded, manipulation-robust task planning (packing, lifting, tool-use) (Hanai et al., 2023, Zhang et al., 2022).
Robustness and Adaptation: CycleGAN and domain-randomized pipelines to minimize data-collection needs when transferring to new sensor gels, lighting, or geometries (Chen et al., 2024, Shahidzadeh et al., 2024).
Closed-Loop Real-Time Feedback: Uniting low-latency vision estimation with adaptive or haptic control for stable and responsive manipulation (Zhu et al., 2022, Refai et al., 4 Aug 2025, Lin et al., 2023).
Expanding Biomechanical Scope: Physical modeling of viscoelastic/dynamic effects and force estimation in in-vivo and ex-vivo human tissues, including soft exosuit–tissue coupling (Annamraju et al., 29 Apr 2025, Refai et al., 4 Aug 2025).
Physics-Informed Learning: Integration of physics priors, modal analysis, and explicit mechanical constraints in network loss design and data labeling (Reyzabal et al., 2024, Wang et al., 15 Jan 2025).

In summary, vision-based force estimation is a rapidly evolving field at the intersection of robotics, tactile perception, medical instrumentation, and machine learning. Recent advances enable high-resolution, markerless, and generalized force inference across diverse applications, grounded in both data-driven regression and physics-based modeling (Castaño-Amoros et al., 2024, Shahidzadeh et al., 2024, Lin et al., 2023, Guo et al., 2024, Annamraju et al., 29 Apr 2025).