Papers
Topics
Authors
Recent
2000 character limit reached

Visual In-Context Learning (VICL)

Updated 27 November 2025
  • Visual In-Context Learning (VICL) is a framework where vision models adapt to new tasks at inference by leveraging a few input–output image pairs without updating model parameters.
  • It utilizes generalist representations from pretraining along with prompt construction and fusion techniques to perform multi-task visual reasoning effectively.
  • VICL has demonstrated practical benefits in areas like medical imaging and restoration, significantly cutting data annotation demands and enhancing task adaptability.

Visual In-Context Learning (VICL) refers to the capacity of large, typically frozen, vision or vision-LLMs to rapidly solve new, previously unseen tasks at inference time by conditioning on a small set of input–output demonstration pairs—known as the “context” or “prompt”—without any parameter updates. The model leverages generalist representations, often acquired via pretraining, to infer task semantics and execute predictions solely from the visual context. VICL extends the successful in-context learning paradigm of LLMs into the vision domain, enabling “learning by example” at inference time across a wide array of visual tasks.

1. Foundations and Motivation

VICL is defined formally as a function fθf_\theta (with frozen weights θ\theta), which, given a context set C={(xi,yi)}i=1KC = \{(x_i, y_i)\}_{i=1}^K of KK input–output image pairs and a query image xqx_q, produces an output y^q=fθ(C,xq)\hat y_q = f_\theta(C, x_q). This task-programming paradigm is in contrast to traditional supervised learning, where each visual task demands a separate, task-specific dataset and model retraining or fine-tuning. The prime motivation for VICL lies in its ability to:

  • Adapt to new tasks or domains with minimal annotation and no model retraining.
  • Drastically reduce data and annotation demands in settings such as medical imaging, under-represented demographics, or rapidly evolving domains.
  • Provide a unified interface for multitask vision problem-solving, mimicking human learning from analogical demonstrations.

Recent studies have shown that models such as SegGPT, Painter, and various UNet derivatives trained with masked image modeling (MIM) or multi-task objectives can perform VICL across a spectrum of tasks from segmentation to generative image transformation (Kumar et al., 2023, Negrini et al., 18 Jun 2025, Chen et al., 2023).

2. VICL Architectures and Mechanisms

The standard VICL pipeline involves a model backbone—often a transformer or large encoder–decoder architecture—pretrained on masked prediction or inpainting. Prompts are typically encoded as one or several image–label pairs (sometimes including additional modalities such as text), stitched together with the query image into a larger composite (e.g., 2×2 or 1×K layouts). The prompt structure is central:

  • Prompt Construction: Images and labels from the support set are concatenated or “stitched” spatially, often with a blank mask channel for the query, into a single tensor.
  • Inference: The model processes the composite through its encoder–decoder layers to predict the missing part (e.g., a segmentation mask for the query).
  • No Gradient Update: All layers are frozen; adaptation occurs purely via conditioning on the prompt.

Recent advances have introduced several architectural innovations:

3. Prompt Selection: Strategies and Challenges

VICL performance is highly sensitive to which examples are selected as in-context prompts. Empirical studies demonstrate a performance variation of over 70 percentage points in segmentation mIoU depending solely on prompt choice (Zhang et al., 2023). Several approaches address prompt selection:

  • Unsupervised Retrieval: Nearest neighbor search in embedding space (e.g., CLIP, DINO) by cosine similarity to the query achieves robust, if not always optimal, prompt selection (Zhang et al., 2023, Sun et al., 2023).
  • Supervised Retrieval: Training a contrastive retriever to directly maximize downstream VICL performance outperforms unsupervised methods, especially in semantically or spatially diverse datasets.
  • Global Ranking Frameworks: Partial2Global introduces transformer-based list-wise ranking over subsets of candidate prompts, with a consistency-aware aggregator fusing partial preferences into a globally coherent ranking; RH-Partial2Global adds reliability filtering and combinatorial coverage for robust, statistically justified selection (Xu et al., 24 May 2024, Wu et al., 30 Sep 2025).
  • Task-Level Prompts: Empirical findings indicate that a single “task-level” optimal prompt composition serves the majority of test samples, allowing selection search to be amortized over all test queries, reducing computational cost by >98% compared to per-sample search (Zhu et al., 15 Jan 2025).
  • Collaboration and Condensation: Rather than competing for a single best example, Condenser fuses multiple prompts spatially-aligned via patch-wise attention, enabling efficient, resolution-preserving context aggregation (Wang et al., 30 Apr 2025).
  • Mitigating Over-reliance: PANICL smooths assignment scores across kk patch-level neighbors from multiple prompts to reduce bias and instability induced by dependence on a single prompt (Zhang et al., 26 Sep 2025).
Prompt Selection Method Core Algorithm Main Advantage
Unsupervised Retrieval Embedding similarity Simple, effective
Supervised Retrieval Contrastive performance tuning Task-optimized
Partial2Global / RH-P2G List-wise/certified ranking Global, reliable
Task-Level Prompting One prompt per task Highly efficient
Prompt Condensation Patch-wise cross-attention Preserves resolution
PANICL (Patch k-NN) Smoothing over top-k neighbors Robust, training-free

4. Prompt Fusion, Multi-Prompt Strategies, and Collaboration

Earlier approaches treated prompt selection as a competitive process, selecting only one “best” example or ensembling predictions from multiple forwards. Recent VICL research emphasizes early fusion and information aggregation:

  • Prompt Fusion: Arranging the prompt and query in all valid 2×2 grids and ensembling predictions increases mIoU by several points (Sun et al., 2023).
  • Patch-wise Cross-Attention (Condenser): At each spatial location, K prompts are fused via attention over corresponding patches, integrating complementary cues (e.g., texture from one prompt, shape from another) without resolution loss, outperforming both single-prompt and ensembling approaches in accuracy and efficiency (Wang et al., 30 Apr 2025).
  • PANICL’s Patch k-NN Smoothing: Rather than relying on ensemble predictions or downsampling, assignment scores for each patch are averaged across their nearest neighbors in the prompt pool, robustly boosting accuracy and stability (Zhang et al., 26 Sep 2025).
  • End-to-End Feedback: Losses on the backbone’s token prediction task allow learning-based condensers to select and fuse prompts adaptively, a significant improvement over naive pooling or voting.

A plausible implication is that collaboration via cross-prompt fusion is a dominant paradigm for maximizing VICL, especially as the number of available context examples or the diversity of potential tasks increases.

5. Domain Extensions and Generalization

VICL has demonstrated flexibility in a range of domains:

  • Medical Imaging: VICL models, such as Retinalizer, can, with a small set of context pairs, solve a variety of retinal OCT tasks—boundary detection, multi-class fluid segmentation, denoising, and generative inpainting—without retraining. Random recoloring induces adaptation to unseen color schemes, enhancing generalization to new vendors and shifted domains (Negrini et al., 18 Jun 2025).
  • Restoration and Weather Modeling: AWRaCLe leverages paired context (degraded and clean images) to guide test-time restoration, using dedicated context extraction and fusion blocks to outperform previous SOTA restoration networks in all-weather evaluation (Rajagopalan et al., 30 Aug 2024).
  • Vision–LLMs: In LVLMs, transforming visual demonstrations into concise task intent summaries, coupled with retrieval and semantic reranking, enables VICL to boost few-shot performance by up to 40 percentage points. Information flow analysis confirms these summaries anchor textual context and mitigate representation gaps (Zhou et al., 18 Feb 2024). Meta-trained LMs can transfer in-context learning mechanisms to vision–language tasks without further weight updates (Monajatipoor et al., 2023).
  • Video In-Context Learning: Large autoregressive Transformers trained on the next-token prediction of video frames acquire the zero-shot capability to perform semantic video imitation by prepending demonstration clips, with effectiveness scaling with model size (Zhang et al., 10 Jul 2024).
  • Diffusion Models and Analogical Reasoning: Self-attention cloning and cross-attention masking in adapted diffusion models (e.g., Analogist) enable accurate analogical reasoning across editing, enhancement, and translation tasks without retraining or slow text-prompt generation (Gu et al., 16 May 2024).
  • Cross-Task VICL: T2T-VICL demonstrates that vision–LLMs can perform cross-task in-context generalization by generating and selecting implicit textual prompts that capture the difference between two distinct tasks, unlocking generalized cross-task reasoning; evaluation blends pixel-based and semantic-aware scoring (Xia et al., 20 Nov 2025).

6. Limitations, Open Challenges, and Future Directions

Current VICL approaches face several open challenges:

  • Context Sensitivity: Despite advanced selection and fusion, performance remains highly example-sensitive. Under significant distribution shift, gains diminish—even sophisticated selection (e.g., Partial2Global) saturates (Xu et al., 24 May 2024).
  • Computational and Memory Efficiency: While fusion and condensation approaches scale better than ensembles, large prompt pools and multi-layer attention still increase overhead. Efficient sampling, task-level prompting, and sparse attention mechanisms are under active development (Xu et al., 24 May 2024, Zhu et al., 15 Jan 2025, Wang et al., 30 Apr 2025).
  • Robustness: PANICL and cross-validation filtering address over-reliance and unreliable prompts, but generalization to arbitrary, out-of-distribution, open-vocabulary, or completely non-visual tasks (e.g., joining text with image context) remains to be fully addressed (Zhang et al., 26 Sep 2025, Wu et al., 30 Sep 2025, Xia et al., 20 Nov 2025).
  • Prompt Design Automation: Automated, reliable, and explainable prompt design remains partially unsolved. Strategies such as reliability filtering (via conformal prediction), covering design-based sampling, and semantic intent summarization are emerging, but scaling these to large, multi-modal contexts or dynamically evolving tasks is an open question (Zhou et al., 18 Feb 2024, Wu et al., 30 Sep 2025, Xia et al., 20 Nov 2025).
  • Extension to Multimodal and Document-Level Contexts: The challenge of scaling in-context length in multimodal models spawns approaches like VisInContext, which compresses text into visual tokens and shows complementarity with long-context attention methods (Wang et al., 4 Jun 2024).

Anticipated future research directions include: adaptive block sizes for sampling (Wu et al., 30 Sep 2025), continual and online adaptation, integration with meta-learned or cross-modal prompt selectors (Negrini et al., 18 Jun 2025, Monajatipoor et al., 2023), hierarchical and cluster-based condensation strategies (Wang et al., 30 Apr 2025), and explicit balancing of semantic and pixel-fidelity in prompt-driven reasoning (Xia et al., 20 Nov 2025).

7. Impact and Empirical Benchmarks

VICL has fundamentally expanded the operational envelope of vision and vision–language foundation models:

  • Few-Shot and Zero-Shot Learning: Models such as SegGPT, with only K=2K=2 examples, surpass accompanied U-Nets trained on hundreds of samples in specialized medical segmentation (Kumar et al., 2023).
  • Robustness and Adaptivity: Test-time tuning (VICT) significantly reduces performance degradation under 15 common corruptions, and can match or outperform few-shot fine-tuning up to k=64k=64 for segmentation (Xie et al., 27 Mar 2025).
  • Medical Imaging Flexibility: Retinal VICL models support arbitrary task composition, with context recoloring driving adaptation to new class+color pairings (Negrini et al., 18 Jun 2025).
  • Efficiency Gains: Task-level prompts yield near-optimal sample-level performance at <2% of the compute, and Condenser enables multi-prompt aggregation with sublinear memory overhead (Zhu et al., 15 Jan 2025, Wang et al., 30 Apr 2025).
  • Cross-Task Generalization: Implicit prompt generation (T2T-VICL) delivers strong cross-task transfer performance, with semantic-aware scores (VIEScore) validating that correct task inference and operation—not just pixel accuracy—are attained (Xia et al., 20 Nov 2025).

These empirical advances are summarized in the following table:

Model/Method Domain Key Benchmark SOTA Metric (VICL, best)
SegGPT (VICL) Med. Derm. Eczema Seg., mIoU 36.7 (K=2 vs 32.6 U-Net)
AWRaCLe Restoration Rain100H, PSNR/SSIM 27.20/0.840
Retinalizer-Rec Med. Retina Fluid Seg., IoU 53.6 vs 38.5 (vanilla)
Prompt-SelF Seg/Det Pascal-5ᶦ, mIoU 41.0 (beat meta-learn)
Partial2Global Seg/Det Pascal-5ᶦ, mIoU 38.4→42.7 (w. voting)
Condenser Seg/Det Pascal-5ᶦ, mIoU 44.1→46.6 (K=16)
PANICL Seg/Det Pascal-5ᶦ, mIoU 35.9→37.9
T2T-VICL Cross-Task Task transfer, VIEScore up to +1.4 (sem. gain)

This body of evidence establishes VICL as a central method for unlocking rapid “few-shot” and analogical capability in vision and multimodal models, with a growing suite of theoretically principled and empirically validated tools for prompt selection, fusion, and generalization.


References: (Kumar et al., 2023, Rajagopalan et al., 30 Aug 2024, Negrini et al., 18 Jun 2025, Chen et al., 2023, Zhang et al., 2023, Xu et al., 24 May 2024, Wu et al., 30 Sep 2025, Wang et al., 30 Apr 2025, Zhu et al., 15 Jan 2025, Zhang et al., 26 Sep 2025, Zhou et al., 18 Feb 2024, Sun et al., 2023, Xia et al., 20 Nov 2025, Zhang et al., 10 Jul 2024, Xie et al., 27 Mar 2025, Monajatipoor et al., 2023, Wang et al., 4 Jun 2024, Gu et al., 16 May 2024)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Visual In-Context Learning (VICL).