Papers
Topics
Authors
Recent
Search
2000 character limit reached

In-Context Segmentation (ICS)

Updated 2 April 2026
  • In-Context Segmentation (ICS) is a paradigm that leverages a few annotated support examples to guide segmentation across diverse tasks without updating model parameters.
  • ICS architectures, as exemplified by the Iris framework, integrate encoders, task-encoding modules, and query-based decoders to fuse context and query features for effective segmentation.
  • ICS achieves robust performance with high Dice scores and emergent anatomical clustering, streamlining adaptation across various in-distribution and out-of-distribution imaging tasks.

In-Context Segmentation (ICS) is a paradigm that enables segmentation models to adapt to highly diverse, heterogeneous tasks and anatomical targets by conditioning on a few input–output example pairs, typically at inference time, with no model parameter updates. This approach leverages the ability to encode task-specific information from annotated reference examples, guiding segmentation on completely novel objects, structures, or modalities across both in-distribution and out-of-distribution settings. ICS departs fundamentally from conventional segmentation, which requires training or fine-tuning for each task, by learning from reference context—either densely represented (full masks) or in weak form (boxes, points)—and supports flexible strategies such as one-shot inference, in-context ensembling, and retrieval. ICS generalizes the notion of in-context learning (ICL) from NLP and vision-LLMs to the domain of visual segmentation, with technical realizations in both 2D and 3D images, as well as 3D point clouds and complex medical imaging domains (Gao et al., 25 Mar 2025).

1. Foundational Principles and Problem Definition

ICS is formally defined as follows: given a query image xqx_q (e.g., a 3D CT or MRI volume) and a context set S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L of LL support examples—each a reference image and its annotation—the segmentation model predicts a mask y^q=fθ(xq,S)\hat y_q = f_\theta(x_q, S) that is semantically aligned to the target structure(s) defined by the context. The context can be a single annotated pair (one-shot), multiple pairs (few-shot), or a mixture of annotation types (full masks, boxes, points). Importantly, the model parameters θ\theta remain fixed during inference; all adaptation is achieved through the encoded context.

ICS is motivated by the need for universal segmentation models capable of segmenting arbitrary anatomical structures, pathological lesions, or object categories without retraining for each new target. This enables robust generalization to new datasets, imaging modalities, class sets, and even previously unseen anatomical or pathological entities (Gao et al., 25 Mar 2025).

2. Model Architecture and Task Encoding

The core architectural challenge in ICS is distilling the task-specific information from a set of context pairs and injecting it into the segmentation pipeline for the query. The Iris framework (Gao et al., 25 Mar 2025) provides a canonical realization in the context of 3D medical imaging:

  • Encoder (EE): A standard 3D U-Net with residual blocks maps each input volume to spatial feature maps F(x)∈RC×d×h×wF(x) \in \mathbb{R}^{C \times d \times h \times w}.
  • Task-Encoding Module (TT): This module receives a reference image–mask pair (xs,ys)(x_s, y_s) and produces a compact task embedding T∈R(m+1)×CT \in \mathbb{R}^{(m+1) \times C}. It comprises:
    • A foreground pooling stream capturing local ROI detail:

      S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L0

    • A contextual-attention stream for global context:

      S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L1

      S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L2

    • The final embedding for one support is S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L3, and for S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L4 classes, embeddings S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L5 are concatenated.

  • Mask-Decoding Module (S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L6): A query-based decoder, inspired by Mask2Former, fuses the query image features S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L7 and the task embedding S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L8 via a bidirectional cross-attention layer:

    S={(xi,yi)}i=1LS = \{(x_i, y_i)\}_{i=1}^L9

The decoder predicts a binary mask per class for the query: LL0.

By decoupling encoding from decoding, the architecture allows plug-and-play context strategies and efficient inference, supporting new, unseen tasks without retraining (Gao et al., 25 Mar 2025).

3. Training Objectives and Learning Strategies

ICS typically employs an episodic training regime to mimic the in-context inference protocol:

  • Episodic Sampling: Each training step samples a reference–query pair or context set from available data.
  • Loss Function: The principal objective is the sum of Dice and cross-entropy (CE) losses computed between the prediction and ground-truth mask of the query:

    LL1

    with LL2 and LL3.

  • Task Decoupling: By separating task encoding (from context) and segmentation, different strategies such as context ensembling, object-level retrieval, and in-context tuning become tractable:
    • One-shot inference: Use a single context embedding for any number of queries.
    • Context ensemble: Average embeddings from multiple references to improve generalization.
    • Object-level retrieval: Identify the most similar reference embedding for each target class in a query.
    • In-context tuning: Adapt embeddings iteratively with gradients on a small corpus without altering the model weights.

Through these strategies, the system can exploit multiple supports, adapt to unusual cases, or improve segmentation for difficult queries (Gao et al., 25 Mar 2025).

4. Quantitative and Qualitative Performance

ICS, as instantiated in the Iris framework, achieves strong empirical performance:

  • In-distribution tasks (12 datasets): Iris achieves a mean Dice of 84.52%, matching or surpassing task-specific models such as nnU-Net (83.18%), CLIP-driven universal (84.18%), UniSeg (84.40%), and Multi-Talent (84.47%). In-context baselines (SegGPT, UniverSeg, Tyche-IS) typically achieve only 57–61% (Gao et al., 25 Mar 2025).
  • Out-of-distribution generalization (held-out datasets): On previously unseen domains and classes, Iris attains Dice scores of 86.45% (ACDC), 82.77% (SegTHOR), 64.44% (CSI-inn), 89.13% (CSI-opp), 47.78% (CSI-fat), 28.28% (MSD Pancreas Tumor), and 69.03% (Pelvic bones), exceeding universal and in-context competitors by 10–20 percentage points.
  • The architecture generalizes well to both cross-modality and cross-institutional data, exhibiting robust adaptation without the need for target-specific fine-tuning.

Qualitative analyses reveal that the learned embeddings not only yield accurate masks but also group anatomically related structures across diverse datasets and modalities. For example, abdominal organs and vascular structures are consistently clustered, indicating that the model's embeddings reflect fundamental biomedical relationships without explicit anatomical supervision (Gao et al., 25 Mar 2025).

5. Task Embedding Analysis and Anatomical Structure Discovery

A distinctive outcome of the Iris framework is the emergent anatomical organization within its task embeddings:

  • t-SNE and other projections of per-class task embeddings naturally cluster similar organs, vessels, and tissue types without supervised taxonomy.
    • Abdominal organs (liver, spleen, kidneys) form modality-independent clusters.
    • Vascular structures (e.g., IVC, portal vein) are tightly grouped, reflecting shared morphology and imaging behavior.
    • Soft tissue regions like bladder and prostate are positioned adjacently in the embedding space due to their physical proximity and similar imaging profiles.

This discovery underscores the model's capacity to autonomously learn clinically relevant relationships and may support automated ontology generation or meta-analytic studies, bypassing the need for manual anatomical labeling (Gao et al., 25 Mar 2025).

6. Impact, Limitations, and Future Directions

ICS frameworks, exemplified by Iris, represent a substantial methodological advance in segmentation:

  • Clinical and research relevance: By obviating the need for extensive task-specific training, ICS reduces annotation and adaptation costs in clinical workflows, accelerating deployability and facilitating personalized or rare-disease segmentation.
  • Flexibility and scalability: The modular task encoding and decoding approach scales readily to new data domains, anatomies, or segmentation challenges.
  • Limitations: Residual caveats include diminished accuracy for extremely out-of-distribution or poorly represented targets, and runtime/memory constraints when ensembling over large context sets.
  • Extensions: Future research directions involve:
    • Extending to weak supervision (boxes, points), further lowering annotation costs.
    • Incorporating automatic context selection and active learning for optimal support set construction.
    • Scaling up to multi-modal and multi-task medical image analysis, leveraging emergent anatomical awareness for integrated disease modeling.

ICS provides both a unifying abstraction for generalist segmentation—encompassing classical, few-shot, interactive, and weakly supervised settings—and a practical, end-to-end solution with compelling accuracy and anatomical interpretability for medical imaging (Gao et al., 25 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-Context Segmentation (ICS).