Visual In-Context Learning (V-ICL)

Updated 3 July 2026

Visual In-Context Learning (V-ICL) is a paradigm where models adapt to visual and multimodal tasks using demonstration pairs without updating their parameters.
It leverages optimized demonstration selection and prompt fusion strategies to compose diverse context sequences and enhance task generalization.
Unified transformer architectures with interleaved token sequences, sparse self-attention, and mixture-of-experts blocks underpin its effective multimodal processing.

Visual In-Context Learning (V-ICL) is the extension of the in-context learning (ICL) paradigm, originally discovered in LLMs, to the domain of vision and multimodal tasks. In V-ICL, a model adapts to new visual or vision-language tasks during inference by conditioning on a small set of demonstration examples, without updating its parameters. Crucially, both the construction of the context sequence (prompts/demonstrations) and the architectural strategies for encoding and leveraging visual information are central to V-ICL’s empirical performance and generalization capabilities. V-ICL spans a continuum from pure visual models operating on images or spatial representations, to vision-LLMs handling arbitrary multi-modal sequences involving images, captions, and complex reasoning instructions.

1. Foundational Principles and Unified V-ICL Formulation

The canonical V-ICL setup defines a task whereby a frozen model is presented with $k$ demonstration pairs—each a combination of a visual input (e.g., image, image patch, segmentation mask) and a corresponding output (e.g., mask, caption, label)—followed by a query input. The objective is to autoregressively predict the correct output for the query, solely by composing the provided demonstrations in context.

A general formalization is:

Given $k$ demonstrations $\{(x^{(i)}_\text{vis}, y^{(i)})\}_{i=1}^k$ and a query $x_\text{vis}^{k+1}$ , the model produces $y^{k+1}$ such that

$P_\theta(y^{k+1} | x_\text{vis}^{k+1}, \{x^{(i)}_\text{vis}, y^{(i)}\}_{i=1}^k)$

Standard approaches proceed by quantizing images and (if present) text into a unified discrete token space via VQGAN (for images, masks) and BPE tokenization (for text), then embedding these as interleaved token sequences (Sheng et al., 2023).

The unified V-ICL transformer architecture, as exemplified by recent works, operates on interleaved sequences of embedded tokens—spanning both modalities—using sparse attention mechanisms and mixture-of-experts (MoE) feed-forward blocks, with all outputs generated by the same decoder stack (Sheng et al., 2023). In context, the model is evaluated by simply prepending the demonstration pairs to the query in the constructed sequence and generating the required output, with no adaptation or re-weighting.

2. Demonstration Selection, Prompt Construction, and Retrieval Strategies

The performance of V-ICL is highly sensitive to how in-context demonstrations are selected, ordered, and composed within the prompt. Early approaches relied on k-nearest neighbor (kNN) retrieval using pretrained image-embedding spaces (e.g., CLIP, ViT, DINOv2), often by computing the feature similarity between the query and each pool element and selecting the most similar (Sun et al., 2023, Foster et al., 2023). However, similarity-based retrieval can lead to overly redundant contexts, particularly in regression or attribute-diverse tasks.

Recent research reframes demonstration selection as an optimization or sequential decision-making problem. The Learning to Select Demonstrations (LSD) framework formulates demonstration selection as a finite-horizon Markov Decision Process, training a dueling Deep Q-Network to compose demonstration sets that optimize downstream MLLM performance via marginal error reduction (Lee et al., 24 Mar 2026). LSD avoids the pitfall of selecting demonstrations too similar to the query and instead actively maximizes the diversity and label-space coverage of the context. Notably, LSD yields substantial improvements over kNN on objective regression tasks, whereas kNN remains near optimal for subjective, preference-based settings.

Prompt fusion—how demonstrations and queries are spatially and positionally composed—matters as much as prompt selection. Pixel-level fusion and grid-based positional arrangements, along with ensembling over multiple permutations, further improve robustness and accuracy (Sun et al., 2023). Flexible memory approaches, such as those adapted from Video Object Segmentation (VOS), allow variable-sized in-context supports and avoid grid resolution bottlenecks (Foster et al., 2023).

3. Architectures, Unification, and Generative Modeling

Unified V-ICL architectures are based on a decoder-only transformer, with a single token vocabulary and embedding for both visual and textual modalities (Sheng et al., 2023). Visual inputs (images, patches, masks) are first passed through an encoder (typically convolutional or vision transformer), quantized via learned codebooks (e.g., VQGAN), and then embedded identically to text tokens. This enables the model to process and relate image→image, image→text, and potentially other modalities in a single pipeline.

Key architectural features include:

Sparse Self-Attention: Block-local and strided sparsity patterns are used for scalability (Sheng et al., 2023).
Mixture-of-Experts (MoE): Every other transformer block uses an MoE FFN, which is critical for reducing task interference (e.g., preventing segmentation and captioning signals from degrading each other’s performance) (Sheng et al., 2023).
Autoregressive Next-Token Loss: All supervision occurs via next-token prediction over every token in the interleaved sequence (including query output tokens), optionally with an auxiliary load-balancing loss for MoE blocks.

For particular tasks, additional design considerations apply. In instance-level object localization, attention regularizers are employed to focus transformer attention from the query image to support bounding-box tokens, and reinforcement learning objectives directly reward high IoU (Karim et al., 29 May 2026).

In the context of segmentation, prompt-supporting learners based on flexible, memory-based architectures—where image/mask pairs are stored as memory without grid concatenation—offer improved generalization to unseen classes and better support set scaling (Foster et al., 2023). Other works leverage learnable prompt perturbations (PEFT) to shift the latent representation of the prompt closer to the target task distribution, providing consistent gains, especially in cases of poor dataset coverage (Zhang et al., 25 Apr 2025).

4. Empirical Benchmarks, Task Coverage, and Evaluation Regimes

V-ICL has been systematically evaluated across a broad spectrum of tasks: semantic segmentation, object detection, image captioning, visual regression, classification, and vision-language reasoning. Benchmarks such as VL-ICL Bench present challenges from perception and recognition to multimodal operator induction and long-context interleaving (Zong et al., 2024). Key findings include:

On class-aware segmentation, unified V-ICL achieves mean IoU (MIoU) of 58.04 on MS-COCO (256×256) with a 309M parameter model, surpassing task-specific baselines under matching resolution (Sheng et al., 2023).
In joint captioning tasks, V-ICL’s unified pipeline yields competitive BLEU4, METEOR, CIDEr, and mAP scores compared to specialized models (Sheng et al., 2023).
Quantitative ablations demonstrate that prompt interleaving, MoE structure, and multi-task sampling are all essential for stable, high-quality multimodal ICL (Sheng et al., 2023).
On multimodal reasoning, state-of-the-art VLLMs (e.g., GPT-4V) exhibit genuine in-context learning but only for a limited number of examples, as higher shot counts can lead to context length overloads and performance degradation (Zong et al., 2024).

5. Strengths, Limitations, and Open Challenges

Strengths

Unified Multimodal Pipeline: A single backbone handles image→image, image→text, and potentially arbitrary quantized modalities (Sheng et al., 2023).
True Zero-Update Adaptation: New tasks can be solved by demonstrating a few context pairs without any fine-tuning or gradient steps.
Scalability Across Tasks: With appropriate architectural choices (MoE, interleaving, memory-based prompting), the model can co-train or adapt across multiple tasks.
Demonstration Optimization: Reinforcement learning-based demonstration selection markedly improves objective regression tasks over baseline retrieval (Lee et al., 24 Mar 2026).

Limitations

Prompt Length and Compute: Each additional demonstration incurs a quasi-linear increase in input size and inference cost, with transformer attention scaling quadratically (Sheng et al., 2023).
Modality Imbalance: Long visual sequences can overwhelm shorter text outputs, requiring careful loss reweighting or auxiliary objectives (Sheng et al., 2023).
Context Utilization: Many current VLMs remain strongly text-driven even in multimodal contexts, failing to fully exploit visual cues from demonstrations (Santos et al., 28 Oct 2025).
Class/Task Generalization: Current frameworks typically operate on single-class, single-task per query; true multi-task and multi-class ICL requires extension of output encoding or architectural modifications.
Data and Sequence Limits: Token limits of modern transformers (2048–8192 tokens) cap the number of demonstrable context pairs and restrict the complexity of few-shot adaptation (Zong et al., 2024).

6. Directions for Enhancement and Future Research

Several key avenues have emerged:

Explicit Cross-Modal Fusion: Training objectives and attention mechanisms that explicitly ground visual outputs in demonstration images, rather than relying solely on textual similarity, are central to further gains (Santos et al., 28 Oct 2025).
Dynamic Prompting and Curriculum Learning: Any-shot, multi-turn instruction tuning with semantically coherent conversation structures improves both few-shot and zero-shot visual reasoning (Doveh et al., 2024).
Hybrid Vector and Sequence Summarization: Learnable in-context vectors (LIVE) that distill demonstration set knowledge into compact shift vectors reduce computation and maintain high performance, particularly in vision-language question answering (Peng et al., 2024).
Higher-Level Task Unification: Ongoing work extends V-ICL paradigms to cross-task adaptation (e.g., from restoration to enhancement to segmentation) and to new modalities (e.g., 3D, audio, web), leveraging generalized token and transformer frameworks (Sheng et al., 2023).

Overall, V-ICL defines a rapidly evolving research area at the intersection of vision, multimodal learning, and foundation model prompting, combining unified architectures, demonstration-aware adaptation, and principled retrieval/fusion to enable training-free and highly flexible visual reasoning across a spectrum of domains and task types.

References

(Sheng et al., 2023) Towards More Unified In-context Visual Understanding
(Sun et al., 2023) Exploring Effective Factors for Improving Visual In-Context Learning
(Foster et al., 2023) Flexible visual prompts for in-context learning in computer vision
(Lee et al., 24 Mar 2026) Learning to Select Visual In-Context Demonstrations
(Santos et al., 28 Oct 2025) What do vision-LLMs see in the context? Investigating multimodal in-context learning
(Zong et al., 2024) VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
(Peng et al., 2024) LIVE: Learnable In-Context Vector for Visual Question Answering
(Doveh et al., 2024) Towards Multimodal In-Context Learning for Vision & LLMs
(Zhang et al., 25 Apr 2025) E-InMeMo: Enhanced Prompting for Visual In-Context Learning
(Karim et al., 29 May 2026) FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization