SATA: Surgical Action–Text Alignment Dataset
- The paper introduces SATA, a dataset linking expert-annotated surgical video clips to concise, spatially grounded text descriptions for improved robotic surgery action modeling.
- It details a rigorous annotation protocol with four distinct suturing sub-tasks, ensuring precise alignment between video actions and clinical text descriptions.
- The integration of inverse dynamics for pseudo-kinematics and synthetic data augmentation significantly reduces error rates and enhances robotic task success.
The Surgical Action–Text Alignment (SATA) Dataset is a curated collection of expert-annotated surgical video clips with aligned, spatially grounded textual descriptions focused on surgical robot perception and control. Developed in the context of advancing vision–language–action (VLA) policy learning for robotic surgery, SATA provides finely segmented video-action pairs across key suturing sub-tasks, with clinical-grade text descriptions validated by surgical professionals. SATA is pivotal to addressing the scarcity of paired video–action data for scaling autonomous surgical skill acquisition, particularly through integration with synthetic data generation and inverse dynamics modeling (He et al., 29 Dec 2025).
1. Dataset Composition and Surgical Task Taxonomy
SATA comprises 2,447 expert-annotated video clips, totaling over 300,000 frames and approximately 10,000 seconds (∼2.8 hours) of surgical procedure footage at a nominal 30 fps. Source data encompass eight distinct procedures, such as prostatectomy, hysterectomy, colorectal intervention, and cholecystectomy. Central to SATA is a four-class, mutually exclusive action taxonomy for suturing, reflecting surgical physical-AI decompositions:
| Action Class | Definition | Number of Clips |
|---|---|---|
| Needle grasping | “Go-to-grasp” trajectory to securely pick up the needle tip | 689 |
| Needle puncture | Controlled insertion of needle into tissue—emphasis on entry and depth | 989 |
| Suture pulling | Drawing thread through tissue after puncture | 475 |
| Knotting | Looping/tightening suture material to secure tissue layers | 294 |
This structure provides granular, clinically meaningful subtask delineation tuned for robust action–text pair modeling in robotic suturing contexts.
2. Annotation Protocol and Data Alignment Methodologies
SATA’s clip extraction aggregated footage from credentialed YouTube sources and six public datasets (GraSP, SAR-RARP50, Multiypass140, SurgicalActions160, AutoLaparo, HeiCo). Manual segmentation ensured each clip highlights exactly one of the four defined actions.
For each clip, surgical domain experts authored a single concise sentence covering: (1) tool–tissue interaction (“punctures,” “grasps”), (2) spatial relation (e.g., left vs. right instrument, anatomical target), and (3) anatomical context (“dorsal venous complex,” “peritoneal fold”). No automated prompt engineering was involved; all text annotation was manual.
Quality control followed a two-stage protocol: primary annotation by a surgical resident and secondary review by an attending surgeon, with random spot-checks on 10% of clips. All sampled checks met “clinically accurate” standards, and no clip underwent more than one relabeling. While no formal inter-annotator agreement metrics (e.g., Cohen’s kappa) were assessed, the multi-stage vetting provides high empirical annotation reliability.
Each action–text pair is released as a JSONL entry, for example:
1 2 3 4 5 6 7 8 |
{
"clip_id": "SAR_RARP50_042",
"source": "SAR-RARP50",
"start_frame": 102,
"end_frame": 157,
"action_class": "needle_puncture",
"text_description": "The left needle driver punctures the right side of the patient’s dorsal venous complex."
} |
3. Kinematics Alignment and Inverse Dynamics Model
SATA addresses the deficit in paired video–kinematics data by leveraging pseudo-kinematics derived via an inverse dynamics model (IDM). The methodology consists of training an IDM to map frame pairs to the corresponding control vector. Synthetic video rollouts generated by the SurgWorld model are input into the IDM, producing predicted pseudo-actions .
Each action at time is represented as a 20-dimensional vector:
with:
- : Cartesian offset of left tool tip,
- : 6D continuous rotation (left end-effector),
- : gripper opening, and analogously for the right arm.
The IDM training objective is the mean-squared error between predicted and ground-truth action vectors:
where is the predicted action and is the ground-truth from teleoperation demonstrations.
World model (SurgWorld) loss leverages a flow-matching loss in latent video space, incorporating both frame interpolation and text condition encoding:
where is the initial context.
4. Dataset Structure, Statistics, and Examples
SATA clips are uniformly preprocessed to a resolution of 224×224 pixels, with frame rates standardized at 30 fps (raw source rates vary from 24–60 fps). The breakdown of action classes and counts is as follows:
| Action | Number of Clips |
|---|---|
| Needle grasping | 689 |
| Needle puncture | 989 |
| Suture pulling | 475 |
| Knotting | 294 |
Sample annotation entries integrate visual, pose, and linguistic details, e.g.:
- “needle_pulling”: “The right forceps pulls the suture through the anterior rectus sheath.”
- “needle_grasping”: “The left driver approaches and secures the needle tip at 30° angle.”
This fine-grained alignment supports precise state–action supervision relevant for both human and machine policy learning.
5. Applications in Vision–Language–Action Policy Training
SATA is a core component in large-scale VLA policy development when combined with synthetic data. For policy training, the base model is the GR00T N1.5 vision–language–action transformer, trained with behavior cloning (incorporating a flow-matching head and cross-entropy on gripper states). Training uses both real robot teleoperation demonstrations (60 episodes) and synthetic (video, action) pairs from SurgWorld world-model rollouts processed by the IDM.
Empirical results highlight the value of SATA-based augmentation:
- Mean-squared error (MSE) per dimension (on 40 held-out episodes):
- Real-only (finetuned on 5 demonstrations):
- Real + synthetic (1× rollout): (∼19% reduction)
- Real + synthetic (10× rollout): (∼37% reduction)
- Surgical robot task success (Needle Pickup & Hand-Over):
- Zero-shot: 0%
- Finetuned on 5 trajectories: 51.8%
- SATA-pretrained + SurgWorld finetune: 73.2%
A plausible implication is that SATA's expert-aligned action–text pairs, when combined with synthetic augmentation, yield significant gains in both low-level action accuracy and high-level task success rates over real-only policy learning.
6. Significance and Forward-Looking Implications
SATA addresses the data bottleneck in surgical robotics by establishing a robust action–text–video alignment framework validated by domain experts. It enables scalable VLA model training, with demonstrable improvements in both offline error metrics and real-world robotic task execution. The integration of pseudo-kinematics via inverse dynamics represents a practical route for capitalizing on large volumes of unlabeled surgical video.
SATA, in conjunction with SurgWorld and IDM, exemplifies a model-driven approach towards data-efficient, generalizable skill acquisition in surgical robotics. The comprehensive annotation quality and detailed kinematic alignment supports a range of research in representation learning, policy transfer, and robust generalization to novel procedures. Applications suggest the potential for SATA-based modeling to accelerate the transition from human teleoperation to autonomous physical-AI in surgery (He et al., 29 Dec 2025).