STANet: Smoke-Type-Aware Desmoking Network
- The paper introduces a novel deep learning architecture that explicitly distinguishes between diffusion and ambient smoke for enhanced desmoking performance in laparoscopic videos.
- It integrates type-specific mask segmentation and cross-attention–based disentanglement to effectively separate overlapping smoke effects and restore fine structural details.
- Empirical results using a large-scale synthetic dataset demonstrate superior restoration quality, segmentation accuracy, and improved generalization to real-world surgical scenarios.
Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) is a purpose-built deep neural network architecture for real-time desmoking of laparoscopic videos, explicitly leveraging the spatio-temporal properties and diverse motion dynamics of different surgical smoke types. Unlike prior approaches that treat surgical smoke as a generic haze, STANet introduces a formal discrimination between Diffusion Smoke and Ambient Smoke, each exhibiting unique distribution and motion characteristics within surgical scenes. This explicit smoke-type awareness is operationalized both in its architecture—through type-specific mask segmentation, cross-attention–based disentanglement, and targeted feature processing—and in its use of the first large-scale synthetic laparoscopic video dataset with detailed smoke-type annotations. Empirical evidence demonstrates STANet’s superior restoration quality, segmentation accuracy, and generalization to real-world surgical tasks compared to state-of-the-art baseline methods (Liang et al., 2 Dec 2025).
1. Conceptual Foundation: Smoke-Type Decomposition
Surgical smoke generated by electrocautery or lasers impairs visual guidance in minimally invasive procedures. Through empirical observation, smoke plumes in laparoscopy display two primary motion patterns:
- Diffusion Smoke: Characterized by turbulent, eddy-rich, locally concentrated plumes emanating from surgical tool tips at the moment of tissue cauterization. Spatially localized, with dynamic spatio-temporal patterns.
- Ambient Smoke: Comprises widespread, slowly drifting clouds that disperse across the cavity, primarily influenced by airflow and cavity pressure rather than active tissue manipulation.
These distinct types yield spatial, temporal, and perceptual variabilities not captured by conventional haze models. In practice, video frames may contain only Diffusion, only Ambient, or an entanglement—where both types coexist and spatially overlap.
2. Network Architecture and Key Modules
STANet comprises two principal sub-networks linked via type-specific guidance:
- Smoke Mask Segmentation Sub-network: Performs joint estimation of binary smoke regions and categorical prediction of smoke type for each frame, relying on an attention-weighted aggregation mechanism. The mask segmentation task is supervised by the per-frame, per-type ground truth from synthetic data.
- Coarse-to-Fine Disentanglement Module: Embedded within the segmentation branch, this module addresses the challenge of overlapping smoke types. It utilizes a smoke-type-aware cross attention mechanism to generate disentangled masks, refining the predictions by modeling correlations and spatial intersections between non-entangled (pure type) and entangled (co-occurring) regions.
- Smokeless Video Reconstruction Sub-network: Conditioned on the predicted masks, this branch reconstructs visually desmoked frames. Targeted feature-level desmoking is performed per smoke type, enabling removal strategies tailored to local physical propagation and plume characteristics.
This architectural synergy ensures that STANet not only demixes regions with entangled smoke but also recovers fine structural details critically needed for downstream clinical vision tasks (Liang et al., 2 Dec 2025).
3. Synthetic Dataset: Smoke-Type-Specific Video Desmoking (STSVD)
The development and evaluation of STANet are supported by the STSVD dataset, designed to overcome the lack of real annotated smoky laparoscopic videos. Notable features include:
- Source Material: 720×1080 high-resolution smoke-free laparoscopic videos from Cholec80, M2CAI16, and the Hamlyn dataset.
- Physically Driven Simulation: Adopts a physics-based rendering engine (AALIDNet-derived) with wavelet-based turbulence from [Kim et al. 2008], cavity pressure modeling, and multi-dimensional parameter randomization for scene diversity.
- Surgical Tool-tip Alignment: The source location for Diffusion Smoke is detected by a small CNN (3 × Conv–ReLU–MaxPool, 2 × FC layers) pinpointing the electrocautery/laser tip in clean frames, used as the plume injection point.
- Soft Compositing Model: Synthetic smoke is composited as
where is the clean frame, the smoke mask, atmospheric light, a per-frame transparency weighting, and Aug(·) a motion-blur post-process.
- Dataset Statistics: 120 video clips, each of 100 consecutive frames, for a total of 12,000 frames. Clips are approximately balanced among Diffusion, Ambient, and Entangled smoke types (, , ).
- Mask Annotations: For each frame, two floating-point masks , , and one-hot clip-level smoke type label.
A table comparing STSVD with prior synthetic smoke datasets highlights advances in temporal coherence, number of smoke types, and physically informed simulation:
| Dataset | Media | Res. | #Frames | #Smoke Types | Phys. Simulation | Tool-Tip Alignment |
|---|---|---|---|---|---|---|
| MARS-GAN | Image | 256×256 | 18,000 | 1 | ✗ | ✗ |
| PFAN | Image | 480×480 | 660 | 1 | ✗ | ✗ |
| PSv2rs | Image | 256×256 | 54,420 | 1 | ✗ | ✗ |
| STSVD | Video | 720×1080 | 12,000 | 3 | ✓ | ✓ |
4. Quantitative and Qualitative Properties of STSVD
- Pixel-wise Coverage: Ambient smoke covers a larger pixel fraction () compared to Diffusion (); Entangled clips yield composite coverage ().
- Mean Mask Size: Average area per connected mask is ≈0.2 million px for Diffusion and ≈0.5 million px for Ambient.
- Temporal Structure: In Entangled videos, Diffusion dominates the first 30 frames, with Ambient prevailing in the subsequent 70, mirroring electrocautery event timelines.
- Annotation Reliability and Synchronization: Masks are rendered, requiring no manual correction; synthetic smoky and clean frames are frame-aligned, facilitating supervised evaluation.
- Evaluation Metric: Downstream segmentation quality is measured using intersection-over-union (IoU):
A plausible implication is that these detailed, high-quality annotations support robust smoke-type–specific network supervision and facilitate meaningful performance comparisons.
5. Experimental Results and Model Generalization
Extensive evaluation demonstrates that the smoke-type distinction in both data and model fosters improvements over baseline desmoking techniques in multiple metrics and across diverse clinical tasks. The entire STSVD dataset is used for supervised model development and synthetic-domain testing. For external validation, two real video datasets—Vivo (paired real) and STSVD-R (unpaired real)—are utilized to assess transferability.
Key outcomes include:
- STANet achieves higher-quality desmoking and segmentation as quantified by standard metrics.
- The network exhibits superior generalization, notably in downstream surgical scene tasks, attributed to explicit disentanglement and type-specific guidance (Liang et al., 2 Dec 2025).
6. Access, Licensing, and Reuse Considerations
- Availability: The full dataset, including all 120 video clips (MP4, H.264), per-frame mask PNGs, metadata CSV, and compositing code, is downloadable at https://simon-leong.github.io/STSVD/.
- Licensing: No formal license is specified in the primary manuscript; researchers are advised to consult the project page or corresponding author for permitted uses. Common licensing schemes such as CC BY-NC or MIT-style may apply but are not guaranteed.
This infrastructural support, anchored by richly annotated, temporally coherent, type-aware video data, is intended to promote development and rigorous evaluation of advanced laparoscopic smoke removal and segmentation systems under clinically relevant, physically realistic conditions (Liang et al., 2 Dec 2025).