Interactive Segmentation Techniques

Updated 3 July 2025

Interactive segmentation is a framework that combines user inputs like clicks, scribbles, and contours with algorithmic updates to accurately delineate objects.
It employs diverse interaction modalities and encoding methods such as binary maps and distance transforms to iteratively enhance segmentation masks.
Applications span medical imaging, dataset creation, and image editing, reducing annotation effort while improving precision in challenging visual contexts.

Interactive segmentation is a paradigm in computer vision that integrates human input directly into the segmentation process, enabling precise, efficient extraction of object or region masks via structured human–machine collaboration. This approach targets scenarios where fully automatic methods struggle—such as ambiguous boundaries, novel object classes, or domain shifts—and seeks to minimize user effort while maximizing segmentation quality and controllability.

1. Foundational Concepts and Principles

Interactive segmentation (IS) operates by iterating between algorithmic prediction and user input—such as clicks, scribbles, or contours—to guide and correct segmentation boundaries. Unlike fully automatic segmentation, IS explicitly models the loop wherein each user action leads to a machine update, which the user then reviews and further refines if necessary. This collaborative structure underpins both classic methods (graph cuts, level sets) and modern learning-based approaches.

Guidance signals from the user are encoded in various formats depending on the method: as binary maps (containing positive/negative clicks), distance transforms, geodesic maps, contour masks, or input channels appended to the image. The IS workflow involves the following high-level loop:

The model predicts a segmentation mask for the input image, optionally using prior user annotations or interactions.
The user inspects the result, then provides additional input (e.g., correcting errors by placing positive clicks inside missed objects, negative clicks to erase false positives, or drawing scribbles/contours for ambiguous regions).
The model incorporates this input, updates the segmentation, and presents the refined mask to the user for further assessment.

2. User Interaction Modalities and Encodings

A central design aspect of IS lies in the nature and encoding of user input. Research has demonstrated that the user can guide the segmentation via multiple interaction modes:

Clicks: Discrete points labeled as object (positive) or background (negative), typically encoded as Gaussian heatmaps or disk masks. Click-based approaches are standard due to their simplicity but may be laborious for small or complex objects.
Scribbles: Freeform strokes providing broader cues, often more expressive and efficient for complex regions (1709.03450).
Contours: Loose or tight closed curves encapsulating a region, enabling rapid selection of single or multiple objects (2302.06353). Contour input reduces effort by matching the accuracy of many clicks with a single gesture.
Multi-gesture and Context-Free Interfaces: Recent works support mixtures of clicks, scribbles, lassos, or rectangles without requiring the user to specify intent explicitly (2307.10518).

Interaction signals are transformed into input tensors for the network. For example, click maps are appended as extra channels alongside the RGB image for deep models. In feedback control theory-inspired methods, user corrections modify the system as impulsive controls, providing formal stability guarantees (1606.08008).

3. Algorithmic Frameworks and Model Architectures

The development of IS algorithms spans from graph-based to deep learning methods, with recent approaches integrating transformers and continual learning. Key frameworks include:

Graph-based and Label Propagation Methods: Early and robust IS techniques model the image as a graph (pixels, superpixels) with labels propagated from user seeds via random walks or networks with the small-world property, yielding high accuracy from sparse scribbles while remaining computationally efficient (1901.02573).
Attributed Relational Graphs: Modeling both the image and the user-specified region of interest as attributed relational graphs, with segmentation cast as a graph matching problem that seeks to minimize structure-preserving deformation cost (0805.1854).
CNN/FCN-based Approaches: Fully convolutional networks integrate user input as auxiliary channels, iteratively refining segmentation as more corrections arrive (1709.03450). User-model-based iterative training further aligns model behavior with actual user correction patterns (1805.04398).
Vision Transformers: Transformer backbones, due to their ability to model long-range dependencies, can encode both image and interaction information effectively. Some models transfer guidance from exemplars (already segmented objects) to speed up multi-object segmentation in the same image (2406.11472).
Gaussian Process Classification: Treating IS as GP-based pixel-wise binary classification allows explicit, theoretically-guaranteed label propagation, making predictions at user clicks correct by construction and enabling efficient linear-time inference (2302.14578).
Quasi-Conformal Mapping and Topology Preservation: Ensures that interactive corrections preserve desired topology in segmentation results, crucial for medical or scientific imaging (2402.14695).

4. Robustness, Adaptation, and Efficiency

IS must remain robust to variable user input and domain shifts:

Test-Time Adaptation (TTA): Methods such as DC-TTA partition user clicks into coherent subsets and adapt individual model replicas per subset, merging their specialized knowledge for improved handling of complex (e.g., camouflaged or multi-part) objects (2506.23104).
Continual Learning: Models like RAIS maintain a stable global backbone while quickly adapting local parameters in response to user feedback and domain changes, balancing robustness with plasticity (2210.10984).
Robustness Evaluation: Recent benchmarks (e.g., TETRIS) explicitly measure sensitivity to user click location by generating both optimal and adversarial click sequences using white-box attacks, emphasizing that high benchmark scores may not translate into real-world robustness (2402.06132).

Efficiency advances include real-time architectures (e.g., InterFormer) that decouple heavy image encoding (done offline or on server/hardware) from lightweight, rapid online fusion of user input, allowing deployment on low-power devices (2304.02942). User interaction minimization is addressed through strategies such as exemplar transfer (multi-object IS), contour-based interfaces, and diversified seed proposals for swipe gestures on touch devices (1812.07260).

5. Evaluation, Benchmarks, and Practical Applications

IS methods are assessed by metrics such as:

NoC@IoU: Average number of user actions (clicks, scribbles) required to reach a specified IoU threshold, typically 85% or 90%.
Dice coefficient (DSC), mean IoU (mIoU): Overlap metrics commonly reported.
Robustness metrics: Difference between area under best- and worst-case IoU–click curves (2402.06132).
RICE: A relative improvement metric that quantifies correction rather than just overlap (2307.10518).

Benchmarks span natural images (COCO, Pascal VOC, SBD, Berkeley, GrabCut), multi-object and video segmentation datasets, medical imaging (BraTS, CT/MRI), and noisier or specialized domains (e.g., camouflaged objects).

Applications include:

Medical annotation: Reducing expert effort for organ, tumor, or lesion marking (2311.13964).
Large-scale dataset creation: Enabling rapid, flexible annotation for supervised learning.
Image/video editing: Assisting in precise cut-out and compositing tasks.
Mobile and fieldwork settings: Touchscreen-optimized interfaces and low-compute pipelines.

6. Ongoing Challenges and Future Directions

The field remains dynamic, with several prominent challenges and research avenues:

Domain Adaptation and Generalization: Ensuring out-of-domain performance with minimum user correction, leveraging continual/adaptive or prompt-based architectures (2210.10984).
Multi-gesture, Context-Agnostic Interaction: Developing universal models robust to various gesture types with or without explicit user intent signals (2307.10518).
Multi-object and Exemplar Transfer: Transferring knowledge between objects within the same category to minimize repetition and user effort (2406.11472).
Standardization and Benchmarking: There is a recognized need for unified protocols, metrics (including user-centric and robustness measures), and public baselines, particularly in specialized domains like medical imaging (2311.13964).
Annotation-minimal and Unsupervised Learning: Exploiting self-supervised features and simulated region hierarchies to train IS models without manual masks, yielding strong results with no labeling (2303.13399).

7. Summary Table: Key Interactive Segmentation Paradigms

Paradigm	Core Mechanism	Notable Properties
Graph-based propagation	Label diffusion on graphs, often in 2 stages	Minimal input, efficient, scalable, topology-agnostic
CNN/Transformer-based iterative refinement	Deep learning with auxiliary user input channels	High accuracy, adaptable, supports complex cues
Exemplar-based transfer for multi-object IS	Transfer knowledge from one mask to related objects	Labor saving for similar object segmentation
Test-time adaptation & Divide-and-Conquer	Online optimization per-subset of user cues	Handles complex/conflicting corrections robustly
Robustness-evaluated/Adversarial IS	Stress-testing with adversarial and user-like inputs	Directly quantifies real-world reliability
Unsupervised/Self-supervised IS	Trains IS using simulated regions, no manual labels	High label efficiency, enables rapid deployment

References

Noma et al., A New Algorithm for Interactive Structural Image Segmentation (0805.1854)
Duan et al., TETRIS: Towards Exploring the Robustness of Interactive Segmentation (2402.06132)
MyersDean et al., Interactive Segmentation for Diverse Gesture Types Without Context (2307.10518)
Shi, Liu et al., Learning from Exemplars for Interactive Image Segmentation (2406.11472)
You Huang et al., InterFormer: Real-time Interactive Image Segmentation (2304.02942)
QI et al., QIS: Interactive Segmentation via Quasi-Conformal Mappings (2402.14695)
Yang et al., IFSENet: Harnessing Sparse Iterations for Interactive Few-shot Segmentation Excellence (2403.15089)
GPCIS authors, Interactive Segmentation as Gaussian Process Classification (2302.14578)
Shi et al., Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation (2303.13399)
Reinke et al., Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy (2311.13964)
Zhang et al., RAIS: Robust and Accurate Interactive Segmentation via Continual Learning (2210.10984)
Mahadevan et al., Iteratively Trained Interactive Segmentation (1805.04398)
Li et al., Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation (2506.23104)

This corpus of research demonstrates both the rapid evolution of IS and the continued centrality of user–algorithm synergies in real-world segmentation workflows.