Papers
Topics
Authors
Recent
2000 character limit reached

CAT-SG Dataset: Cataract Surgery Scene Graph

Updated 23 December 2025
  • CAT-SG dataset is a large-scale, temporally annotated collection that captures detailed tool-tissue interactions and procedural dynamics in cataract surgery videos.
  • The dataset utilizes per-frame scene graph annotations with 29 object classes and 8 fine-grained surgical relations to model tool-tissue interactions and temporal dependencies.
  • It supports advanced AI-driven applications such as surgical training, intraoperative decision support, and workflow mining through robust baseline evaluations and quantitative benchmarks.

The Cataract Surgery Scene Graph (CAT-SG) dataset is a large-scale, temporally grounded resource for fine-grained modeling of the complex, dynamic workflows in cataract surgery. CAT-SG pioneers the structured annotation of tool–tissue interactions, procedural variations, and temporal dependencies by introducing per-frame, instance-level scene graphs that capture semantic relations among instruments and anatomical structures observed through high-resolution surgical microscope videos. Anchored on the CATARACTS benchmark, this dataset unifies spatial and semantic labels across 50 cataract procedures with rigorous temporal coherence, enabling advanced AI-driven analyses for surgical training, workflow recognition, and intraoperative decision support (Holm et al., 26 Jun 2025).

1. Data Composition and Acquisition

CAT-SG encompasses 50 complete cataract surgery recordings from Brest University Hospital, each captured at 1920×1080 resolution by a fixed surgical microscope. The dataset aggregates approximately 9.2 hours of operative footage (average 11 minutes per procedure, with individual durations spanning 6–40 minutes). Video annotation is standardized at 5 frames per second, balancing annotation-density and practical effort, to yield a total of 164,162 annotated frames. The frame rate reduction facilitates tractable labeling without forfeiting essential temporal fidelity for surgical events. The camera setup maintains consistent overhead lighting and field of view to optimize downstream model generalization. All included surgeries reflect real-world intraoperative variability but originate from a single-site and device context, introducing controlled domain constraints (Holm et al., 26 Jun 2025).

2. Scene Graph Schema and Temporal Dynamics

Each annotated frame tt is represented by a dynamic scene graph,

Gt=(Vt,Et),G_t = (V_t,\,E_t)\,,

where VtV_t is the set of detected entities and EtVt×VtE_t \subset V_t \times V_t the set of semantic relations. Nodes vVtv \in V_t denote either anatomical structures (e.g., cornea, iris) or surgical instruments (e.g., phaco handpiece, Bonn forceps), and are enriched with:

  • Segmentation masks Mv\mathcal{M}_v and bounding boxes bboxv=(x,y,w,h)\mathrm{bbox}_v=(x, y, w, h),
  • Type indicator cv{instrument,anatomy}c_v \in \{\text{instrument},\,\text{anatomy}\},
  • Instrument-specific operational states, and (optionally) tissue condition labels.

Edges (vi,vj)Et(v_i, v_j) \in E_t are semantically labeled with one of eight fine-grained surgical relations: Holding, Activation, Inserting, Retracting, Pulling, Cutting, Pushing, and Close to (denoting geometric adjacency). The full dataset sequence forms

G={G1,G2,...,GT}\mathcal{G} = \{G_1, G_2, ..., G_T\}

with explicit temporal edges

Etemp={(vt,i,vt+1,i)vt,i and vt+1,i are the same object}E_\mathrm{temp} = \{(v_{t,i}, v_{t+1,i})\,|\,v_{t,i} \text{ and } v_{t+1,i} \text{ are the same object}\}

to support inter-frame tracking and spatiotemporal reasoning. This representation supports the modeling of nuanced surgical actions and workflow transitions, distinguishing CAT-SG from prior datasets that focus on isolated frame-level or phase annotations (Holm et al., 26 Jun 2025).

3. Dataset Organization and Label Statistics

CAT-SG comprises 29 object classes: 19 distinct surgical instruments and 10 anatomical structures. The standard split allocates 30 videos for training, 10 for validation, and 10 for testing. Across 164,162 frames, the annotation process yields 1,811,252 labeled tool–tool and tool–anatomy relations. The frequency of relation types is dominated by “Close to” (\approx1,680,000), with considerably fewer instances for Activation (44,552), Inserting (34,016), Retracting (23,886), Pulling (11,895), Holding (13,380), Pushing (3,874), and Cutting (1,925). To address class imbalance, training samples video chunks likely to contain non-trivial relations and appropriately subsamples negative relation (“none”) examples to prevent degenerate learning. Table 1 lists relation frequencies and object class counts:

Object Category Count Example Relations
Instruments 19 Holding, Cutting
Anatomical 10 Activation, Close to

This organization facilitates balanced training for multi-class relation prediction under extreme label skew (Holm et al., 26 Jun 2025).

4. Annotation Protocol and Quality Assurance

Annotation involved nine trained student raters utilizing a custom video-annotation platform to localize tool–anatomy pairs, assign bounding polygons, and label semantic relations at 5 fps. The initial 1,200 annotator-hours were followed by repeated review cycles led by senior clinical experts, ensuring correction of mislabels and resolution of ambiguities (especially short-duration interactions). Although formal Cohen’s κ\kappa was not published, estimated inter-rater agreement exceeds 0.85 on a randomly selected subset (5 videos). Final labels reflect majority vote and expert adjudication in ambiguous scenarios. This protocol establishes both granular precision in framewise annotation and clinical plausibility in ambiguous or edge cases (Holm et al., 26 Jun 2025).

5. Baseline Modeling Approaches and Evaluation Metrics

The CatSGG model establishes the baseline for automated scene graph generation over CAT-SG. The approach splits into two stages: A) Entity localization uses Mask2Former with a VideoSwin backbone, pretrained via the VALOR protocol on \sim2900 cataract surgery videos, to segment per-frame object instances and generate query embeddings. B) Relation prediction forms all pairwise combinations of query vectors: pair(i,j)=[qi;qj]\mathrm{pair}(i,j) = [q_i;q_j] and predicts edge existence (binary, with sigmoid output ei,je_{i,j}) and multi-class relation via binary cross-entropy loss. The CatSGG+ variant pools embeddings over 8-frame chunks for improved temporal consistency.

Key metrics include:

  • Semantic segmentation: mIoU = 92.12% (across 29 classes).
  • Scene graph: micro- and macro-averaged F1 for relation types; CatSGG+ achieves macro-F1 ≈ 43.1% (compared to OracleSV at 34.7%).
  • Surgical phase recognition (19 phases): GATv2 on CAT-SG yields up to 78.6% accuracy, 70.2% F1 (30-frame window).
  • Technique recognition (“Stop & Chop” vs. “Divide & Conquer”): up to 68.8% accuracy, 48.4% F1 for a 10-second temporal window (5 fps) with spatial grounding.

These results indicate robust performance in entity segmentation and semantic relational reasoning, as well as practical relevance to downstream workflow and technique recognition (Holm et al., 26 Jun 2025).

6. Canonical and Emerging Applications

CAT-SG’s relational scene graphs underpin several advanced clinical and research applications:

  • Real-time intraoperative support, by detecting hazardous instrument–tissue proximities (e.g., phaco tip near the posterior capsule) and issuing surgeon alerts.
  • Automated training and feedback, by evaluating instrument dynamics (such as pushing/pulling) against reference standards during nucleus splitting or tissue manipulation.
  • Workflow mining, extracting canonical interaction sequences for each surgical phase to inform operating room optimization and best-practices benchmarking.
  • Context-aware alert suppression, leveraging relation graphs to avoid false alarms from non-critical tool–tissue contacts.
  • Quantitative performance assessment, offering spatiotemporal metrics for individualized skill scoring and credentialing.

These use cases exemplify the utility of temporally coherent, relationally grounded datasets in augmenting both AI interpretability and clinical safety (Holm et al., 26 Jun 2025).

7. Limitations and Prospective Directions

CAT-SG’s coverage is constrained by several documented limitations:

  • Long-tail relation classes: Critical but rare interactions (“Cutting,” n1900n \approx 1900) limit robust model generalization for low-frequency events.
  • Single-site and device origin: All data is sourced from one hospital and microscope model, producing domain constraints.
  • Modality absence: The dataset omits kinematic data, audio signals, and stereo-depth information that would enhance multi-modal modeling.
  • Non-suture task exclusion: Fine manipulation phases such as suturing are not captured, reducing utility for some skill-transfer studies.

Proposed future directions include multi-institutional data aggregation, robotic kinematic integration, higher frame-rate annotation for fast maneuvers, and the addition of nuanced tissue state labels (e.g., capsule tears) to further expand annotation granularity (Holm et al., 26 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CAT-SG Dataset.