Objective Structured Assessment of Technical Skills

Updated 24 January 2026

OSATS is a multi-domain framework that utilizes six rating scales to objectively assess and quantify surgical skills in operative settings.
It employs dual-expert annotation and consensus protocols to generate reliable, procedure-agnostic performance labels for surgical procedures.
Recent research integrates classical feature engineering and deep learning architectures to enable automated, real-time evaluation of surgical techniques.

The Objective Structured Assessment of Technical Skills (OSATS) is a validated, multi-domain framework for the standardized and objective assessment of operative skill in surgical settings. OSATS was developed to provide reproducible, procedure-agnostic ratings of technical competence, enabling both expert-guided evaluation and machine-driven quantification of performance. Contemporary research operationalizes OSATS as both the gold standard for ground-truth skill labels and as the basis for downstream supervised or weakly supervised learning in automated surgical skill assessment.

1. OSATS Framework: Domains, Scale, and Aggregation

OSATS structures the assessment of surgical skill as a vector of domain-specific ratings. Across the literature, a six-domain version is most prevalent, especially in datasets such as JIGSAWS and SAR-RARP50 (Anastasiou et al., 11 Sep 2025, Hu et al., 17 Jan 2026, Quarez et al., 2024). Each domain is scored by a trained expert on a discrete 1–5 Likert scale, with 1 denoting poor performance and 5 denoting excellence:

Domain (canonical label)	Description
Respect for Tissue (RT)	Atraumatic tissue handling, avoidance of unnecessary damage
Suture/Needle Handling (SNH)	Efficiency, accuracy, and safety in suture and needle management
Time and Motion (TM)	Economy, smoothness, and fluidity of operative motion
Flow of Operation (FO)	Logical progression and continuity of surgical steps
Overall Performance (OP)	Global impression of technical skill
Quality of Final Product (QFP)	Integrity of the procedural outcome (e.g., secure knots, closure)

Let $s_d$ denote the integer score in domain $d$ ( $d=1,\ldots,6$ ). The sum of the six domains defines the Global Rating Score (GRS):

$\text{GRS} = \sum_{d=1}^6 s_d$

The GRS thus ranges from 6 to 30, though in practice the observed range may be more restricted (e.g., 19–30 in prostatectomy datasets (Anastasiou et al., 11 Sep 2025)). Some approaches retain the full 6-vector for fine-grained analysis (Hu et al., 17 Jan 2026, Quarez et al., 2024), whereas others aggregate GRS into binary labels, e.g. “proficient” (GRS 19–24) versus “expert” (GRS 25–30) to facilitate robust few-shot classification (Anastasiou et al., 11 Sep 2025).

2. OSATS Annotation Protocols and Rater Agreement

Manual OSATS annotation remains labor-intensive and requires domain expertise. Contemporary protocols, as in (Anastasiou et al., 11 Sep 2025), employ dual-expert annotation: two robotic surgery specialists independently score each trial, followed by consensus adjudication to resolve discrepancies and establish a single “ground-truth” label vector per case. While formal inter-rater reliability metrics (e.g., Cohen’s $\kappa$ ) are sometimes omitted, consensus guarantees a reference standard suitable for model supervision. In the context of the JIGSAWS dataset, all OSATS domains are usually annotated post hoc by a single cardiac surgeon (Hu et al., 17 Jan 2026), and these trial-level labels are used to supervise regression or classification tasks.

3. Classical Automation: Hand-Crafted Features and Entropy Methods

Initial efforts to automate OSATS scoring relied on hand-crafted time-series features from video and inertial sensor data (Zia et al., 2017). Feature construction encompasses:

Sequential Motion Texture (SMT): Statistical descriptors of temporal correlation structures.
Frequency Features: Discrete Cosine Transform (DCT) and Discrete Fourier Transform (DFT) representations, encoding repetitiveness and oscillatory content.
Entropy Features: Approximate Entropy (ApEn) quantifies the regularity within a univariate time series; Cross-Approximate Entropy (XApEn) captures asynchrony between channels.

Approximate Entropy for a univariate series $T$ is: $\mathrm{ApEn}(m,r,\tau) = \Phi^m(r) - \Phi^{m+1}(r)$ with $\Phi^m(r)$ defined in relation to data embedding and proximity counts.

These feature vectors feed into classical classifiers (e.g., 1-NN), achieving OSATS criterion-wise classification accuracies exceeding 90% on bench-model tasks when combining video-derived and accelerometer-based entropy features. Early fusion improves discriminability, especially when motion is highly synchronized (Zia et al., 2017).

4. Deep Learning Architectures for OSATS Regression and Classification

Modern research has shifted toward deep learning, leveraging large-scale kinematic and video corpora to directly regress or classify OSATS domains (Wang et al., 2018, Hu et al., 17 Jan 2026, Quarez et al., 2024). Representative pipelines are as follows:

Parallel Convolutional and Recurrent Models (e.g., SATR-DL): A one-dimensional CNN models spatial inter-channel dependencies; a stacked GRU branch captures temporal evolution. Outputs fuse into multi-head classifiers, mapping to skill and task labels. Predicted probabilities are thresholded or mapped to the canonical 1–5 OSATS scale (Wang et al., 2018).
Vision-Transformer and Cross-Attention Models: ViT-based encoders pre-trained via masked reconstruction on unlabeled video form the backbone of systems for few-shot OSATS classification (Anastasiou et al., 11 Sep 2025).
Recursive Attention Networks: Recurrent Transformer-based models (e.g., ReCAP (Quarez et al., 2024)) process windowed kinematic data, maintain state via cross-attention blocks, and emit segment-level pseudo-label predictions for each OSATS domain. These predictions are averaged across segments under a weakly supervised regime, driven by trial-level labels.
Multimodal Fusion: Cross-modal architectures combine vision (CNN/ViT) and kinematic (LSTM/Transformer) features, often yielding improved stability and performance in real-time OSATS estimation (Hu et al., 17 Jan 2026).

5. Quantitative Evaluation: Performance Metrics and Results

Evaluation regimes vary with task formulation:

Classification Settings

Accuracy, Precision, Recall, F1: For binarized “proficient” vs. “expert” OSATS GRS classes, few-shot models reach 1-shot/2-shot/5-shot accuracies of $\approx$ 60.2%, $\approx$ 66.0%, and $d$ 073.7%, with corresponding F1-scores up to 71.2% (Anastasiou et al., 11 Sep 2025).
Leave-One-Out Cross Validation (LOOCV): For entropy-based features, mean accuracy reaches 95.1% on video, 92.2% on knot-tying (Zia et al., 2017).
Per-domain Recall: Deep models (SATR-DL) attain class-wise recall 0.96 (novice), 0.77 (intermediate), 0.95 (expert), with trial-level accuracy up to 0.966 (Wang et al., 2018).

Regression Settings

Spearman’s $d$ 1, RMSE: For continuous OSATS domain regression, vision-only CNNs achieve $d$ 2=0.90 and $d$ 3=0.43 under LOSO CV (Hu et al., 17 Jan 2026). In multimodal fusion, complementary strengths are observed: vision captures subtle cues (tissue deformation), whereas kinematics excels at smoothness quantification.
Segment-level performance: ReCAP yields per-domain $d$ 4 up to 0.95, average OSATS $d$ 5 of $d$ 60.87 (knot-tying), and trial-level GRS $d$ 7 of 0.85 (Quarez et al., 2024).

6. Weakly-Supervised and Pseudo-Labeling Regimes

A recent trend leverages weak supervision to address the lack of segment-level ground truth. Models such as ReCAP (Quarez et al., 2024) enforce the average of segment-level outputs to match observed trial-level OSATS, producing clinically meaningful pseudo-labels at the segment scale. Segment-wise predictions have been validated by surgical experts, with agreement rates significantly exceeding chance (77% vs. 69%, $d$ 8). This supports the use of such outputs for interpretable, real-time feedback during training.

7. Limitations, Recommendations, and Future Directions

Key limitations include the requirement for expert-labeled data, which constrains both temporal annotation fidelity and dataset scale. Single-label-per-trial annotation restricts real-time supervision—the collection of window-level or video chunk OSATS labels has been recommended (Hu et al., 17 Jan 2026). Dataset diversity remains a bottleneck; current benchmarks focus on bench-model tasks or specialized robot-assisted procedures, with generalization to complex operative settings requiring further validation (Zia et al., 2017, Anastasiou et al., 11 Sep 2025). For proxy-based approaches (e.g., SimSurgSkill Challenge (Zia et al., 2022)), the automated metrics do not fully capture domains such as respect for tissue or flow of operation and lack human adjudication.

This suggests that robust OSATS-based automation requires both richer, multi-modal sensing (video, kinematics, force) and expansion of annotation protocols to enable true real-time, domain-specific feedback. Furthermore, the evidence indicates that models trained on high-skill demonstrations generalize more effectively than those based on novice data, reflecting the consistent gesture motifs of expert performers (Hu et al., 17 Jan 2026).

References:

(Anastasiou et al., 11 Sep 2025) Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
(Zia et al., 2017) Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment
(Zia et al., 2022) Objective Surgical Skills Assessment and Tool Localization: Results from the MICCAI 2021 SimSurgSkill Challenge
(Wang et al., 2018) SATR-DL: Improving Surgical Skill Assessment and Task Recognition in Robot-assisted Surgery with Deep Neural Networks
(Hu et al., 17 Jan 2026) Model selection and real-time skill assessment for suturing in robotic surgery
(Quarez et al., 2024) ReCAP: Recursive Cross Attention Network for Pseudo-Label Generation in Robotic Surgical Skill Assessment