FASL-Seg: Adaptive Surgical Segmentation

Updated 14 September 2025

The paper presents FASL-Seg, a transformer-based model that separates feature extraction into low-level and high-level streams for precise segmentation of surgical anatomy and instruments.
It employs LLFP and HLFP streams with multi-head self-attention and upsampling, achieving a notable 9% Dice score improvement compared to state-of-the-art methods.
The model demonstrates robust performance on EndoVis18 and EndoVis17 datasets, ensuring reliable and context-aware segmentation in minimally invasive surgeries.

A Feature-Adaptive Spatial Localization Model (FASL-Seg) is a transformer-based semantic segmentation architecture designed for detailed and context-aware partitioning of surgical scenes, specifically targeting robust anatomy and tool segmentation in robotic minimally invasive surgeries. FASL-Seg addresses the challenge of capturing both fine, low-level edge details and high-level contextual representations by structurally separating feature extraction into two distinct streams—Low-Level Feature Projection (LLFP) and High-Level Feature Projection (HLFP)—enabling precise and consistent delineation of surgical instruments and anatomical structures across multiple datasets and use cases. The model integrates multi-head self-attention for local-global context modeling and rigorous feature upsampling for spatial resolution alignment, realizing notable improvements over state-of-the-art methods in benchmark evaluations.

1. Model Architecture

FASL-Seg builds upon a hierarchical transformer backbone, specifically using the SegFormer encoder to extract multi-scale features. The architecture processes encoder outputs through two specialized branches:

Low-Level Feature Projection (LLFP) Stream: Processes high-resolution features from early encoder stages (blocks 1 and 2), retaining critical edge and boundary information. Each feature map $F_i$ passes through a point-wise convolution ( $Conv_1$ ), followed by batch normalization (BN) and Leaky ReLU (LR), forming a "ConvBlock":

$ConvBlock(F_i) = LR(BN(Conv_1(F_i)))$

Subsequently, a Multi-Head Self-Attention (MHSA) module is applied:

$MHA(Q, K, V) = \text{Concat}(head_1, ..., head_h) W^O$

$head_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$

If required, feature maps are upsampled through an "Up Chain" operation:

$\hat{F}_i = UpChain_n(MHSA(ConvBlock(F_i)))$

High-Level Feature Projection (HLFP) Stream: Processes low-resolution, late-stage encoder features, focusing on semantic and contextual information. Feature maps are refined through a sequence of ConvBlocks ("ConvChain") and upsampling:

$ConvChain_n(F_i) = ConvBlock_n(...ConvBlock_1(F_i))$

$\hat{F}_i = UpChain_n(ConvChain_n(F_i))$

The outputs from LLFP and HLFP ( $\hat{F}_1$ to $\hat{F}_4$ ) are concatenated:

$\hat{F}_{EM} = \text{Concat}(\hat{F}_1, \hat{F}_2, \hat{F}_3, \hat{F}_4)$

A shallow decoder, employing further ConvBlocks and bilinear upsampling, along with a final Laplacian convolution, produces the final segmentation mask.

2. Processing Streams

The architectural innovation lies in FASL-Seg's dual-stream approach, handling feature maps at varying resolutions by their content domain:

The LLFP stream specializes in capturing and refining fine, low-level spatial features—instrument edges, tips, and small structures—where subtle detail is indispensable for accurate boundary delineation.
The HLFP stream consolidates higher-level semantic context, necessary for parsing larger entities such as anatomical organs and complete instruments, where global spatial consistency dominates.

This separation ensures that high-resolution detail is preserved and enhanced independently from broader contextual processing, mitigating the common tradeoff between edge fidelity and semantic correctness in traditional segmentation models.

3. Performance Metrics

Benchmarking on EndoVis18 and EndoVis17 surgical datasets demonstrates notable performance gains:

Dataset / Task	FASL-Seg mIoU (%)	SOTA Improvement	Comments
EndoVis18, Parts & Anatomy	72.71	+5	Robust per-class consistency
EndoVis18, Tool Segmentation	85.61	Best overall	Superior avg. per-class AP
EndoVis17, Tool Segmentation	72.78	Best overall	Competitive vs. transformer

FASL-Seg achieves an approximate 9% increase in Dice score over MedT and TransUNet baselines for anatomy/parts, and similar robust performance for instrument segmentation. Per-class analysis identifies high accuracy for instrument shaft and kidney parenchyma, with distributed per-class results and room for further optimization on classes such as instrument clasper and covered kidney.

4. Evaluated Use Cases

Three distinct segmentation tasks were assessed:

EndoVis18 Parts & Anatomy Segmentation: Simultaneous discrimination of instrument components (shaft, wrist, clasper) and anatomical elements (e.g., kidney parenchyma, small intestine).
EndoVis18 Tool Segmentation: Identification of tool types, providing high-resolution partitioning for instruments including Bipolar Forceps and Large Needle Driver.
EndoVis17 Tool Segmentation: Extension to a second dataset, demonstrating architectural robustness and transferability.

Consistent performance across these tasks evidences the efficacy of FASL-Seg’s twin processing streams in handling both granular and contextual segmentation requirements in real surgical imaging scenarios.

5. Technical Details and Mathematical Formulation

Key architectural components and optimization details:

LLFP ConvBlock: $ConvBlock(F_i) = LR(BN(Conv_1(F_i)))$
Multi-Head Self-Attention: See equations above.
Upsampling via Up Chain: $UpChain_n(F_i) = Up_n(...Up_1(F_i))$
HLFP ConvChain: $ConvChain_n(F_i) = ConvBlock_n(...ConvBlock_1(F_i))$
Feature Fusion: $\hat{F}_{EM} = \text{Concat}(\hat{F}_1, \hat{F}_2, \hat{F}_3, \hat{F}_4)$
Loss Function: Composite of Tversky loss ( $L_{tversky}$ ) and cross-entropy ( $L_{CE}$ ):

$L_{total} = \alpha L_{tversky} + (1-\alpha) L_{CE}$

$L_{tversky} = \frac{TP}{TP + \alpha FP + \beta FN}$

This configuration supports multi-scale decoding with precise boundary localization and rich contextual extraction.

6. Impact and Implications

FASL-Seg’s dual-stream transformer-based segmentation demonstrates substantial improvements in both global classification accuracy and fine boundary preservation, critical for surgical scene understanding. The model’s capacity to faithfully resolve anatomy and instruments can contribute to reliable surgical workflow analysis, automated skill assessment, context-aware guidance, and intraoperative decision support, furthering the integration of deep learning in clinical and training environments for minimally invasive surgery.

The approach also establishes a foundation for more sophisticated multi-stream and multi-scale models in medical imaging segmentation, where the reconciliation of low-level and high-level features remains a central challenge. The operational separation of feature domains, combined with selective self-attention, suggests avenues for enhanced expressivity and generalizability in future architectural variants.

7. Concluding Remarks

FASL-Seg leverages a principled multi-stream feature processing paradigm to resolve the core segmentation challenges in surgical imaging, outperforming previous state-of-the-art models and delivering consistent accuracy across instruments and anatomy. Its technical design and empirical results support ongoing research in adaptive transformer-based architectures, scalable to broader medical and industrial segmentation problems where both fine detail and contextual awareness are essential.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Feature-Adaptive Spatial Localization Model (FASL-Seg).