- The paper introduces the RPC-TCN method, which fuses RGB and point cloud spatial features with temporal convolutional networks to accurately predict surgical forces from vision data alone.
- Experiments show RPC-TCN achieves low mean absolute errors (0.604% in phantom, 0.427% ex vivo) and high correlation (0.99+) with ground truth forces, outperforming methods using single modalities or frames.
- This research demonstrates that combining multi-modal vision data and temporal modeling enables highly accurate surgical force estimation, reducing the need for potentially complex physical force sensors.
The paper presents a technical paper on vision-based force prediction during robotic surgery by combining multi-modal inputs and temporal modeling. The proposed method, called RGB-Point Cloud Temporal Convolutional Network (RPC-TCN), integrates both 2D and 3D spatial features along with temporal convolutional features to infer the contact forces during surgical manipulation.
The approach is divided into two main modules:
- Spatial Block:
- Uses an RGB image input of size 224×224 and a depth image input of size 151×151. The depth image is transformed into a 3D point cloud by applying camera intrinsic parameters and normalization. Specifically, the conversion is given by
- xpc=(xD−cx)fxzD−xˉD,ypc=(yD−cy)fyzD−yˉD,zpc=zD−zˉD,
- where:
- xD,yD: pixel indices of the depth image
- zD: depth value at the pixel
- cx,cy: principal point coordinates
- fx,fy: focal lengths
- xˉD,yˉD,zˉD: mean values used for normalization.
- The method then leverages a pre-trained VGG16 network to extract a 4096-dimensional feature vector from the RGB image and a pre-trained PointNet to extract a 512-dimensional feature from the point cloud (after uniform downsampling from 22,801 points to 2048 points). The concatenation yields a 4608-dimensional feature vector that encodes complementary spatial information.
- Temporal Block:
- The concatenated spatial features for each time step are aggregated over a sliding window (15 frames) to form a temporal feature tensor. The temporal block employs a Temporal Convolutional Network (TCN) that processes these sequential features through a hierarchy of convolutional layers. Each convolutional layer in the TCN applies filters W(i)∈Rd×Fl−1 with bias b∈RFl, where the activations of the lth layer are computed by
- E(l)=f(W∗E(l−1)+b),
- with f(⋅) denoting a non-linear activation function—ReLU was empirically found to outperform other choices—and ∗ representing the convolution operation. Batch normalization follows each convolutional layer to promote stable training.
- The last fully-connected layer employs a linear regression to map the final temporal features to the force prediction, expressed as
- Y^t=UE(L)+c,
- where U is the filter for the last layer, c is the bias, and Y^t is the predicted force at time t.
Two experimental setups are reported: a phantom paper and an ex vivo liver paper. Both experiments use synchronized RGB, depth, and force data acquired at 30 fps, with the force measured by an OptoForce sensor (accuracy up to $\SI{12.5e-3}{\newton}$). In these studies, the model was trained on segmented data (80% for training, 5% for validation, and 15% for testing).
Key numerical results include:
- Phantom Study:
- RPC-TCN achieved a mean absolute error (MAE) of 1.45 N corresponding to a percentage error of 0.604% when compared to the maximal force magnitude measured.
- Ex vivo Liver Study:
- The MAE of RPC-TCN is 0.814 N, equivalent to a percentage error of 0.427%.
Additional comparisons were made against alternative methods:
- A single-frame regression approach based solely on RGB features resulted in significantly higher errors (MAE of 7.06 N and 10.4 N in phantom and liver studies, respectively).
- Temporal models using only RGB features (RGB-TCN) or only point-cloud data (Point Cloud-TCN) performed better than the single-frame method but were still outperformed by the fusion model (RPC-TCN).
The paper also details an analysis of error distribution across force bins. Notably, while the RGB-TCN method showed lower errors for smaller force magnitudes, it suffered from increased error variance at higher forces. In contrast, the point cloud modality provided a more uniform error distribution across different force levels. The fusion approach of RPC-TCN resulted in a consistent error trend across the force spectrum.
Additionally, the authors report high correlation coefficients (0.995 for the phantom and 0.996 for the ex vivo liver experiments) between the predicted and ground truth force values, indicating strong linear correspondence. Visualizations (e.g., correlation matrices and error vs. epoch trend plots) further support the stability and robustness of the proposed approach.
Limitations and Future Directions:
- The training data was biased towards smaller forces, which affected large force prediction accuracy.
- The model was trained and validated on a single phantom and ex vivo liver specimen, suggesting potential overfitting to these specific conditions.
- Future work is proposed to include a more diverse range of tissues and the integration of monocular depth estimation techniques (as evidenced in previous studies) to enhance clinical utility.
Overall, the paper demonstrates that the fusion of 2D visual cues with 3D spatial information, in conjunction with temporal convolutional modeling, can significantly enhance the accuracy and robustness of surgical force estimation without the complexities associated with dedicated hardware-based force sensors.