Few-Shot Learning with Multi-Visual Support

Updated 16 April 2026

The paper presents multi-image episodic frameworks that fuse and align limited support images using mutual and cross-modal attention to enhance classification and segmentation.
Key methodologies include prototype-based meta-learners, adaptive fusion mechanisms, and pixel-wise metric losses, ensuring accurate extraction of class-discriminative features.
Empirical results demonstrate state-of-the-art performance with notable accuracy and mIoU improvements, validating robust generalization across diverse few-shot tasks.

Few-shot learning with multi-visual input support encompasses a family of machine learning techniques that address recognition, segmentation, and reasoning tasks where only a handful of visual examples (support images) per class are provided in each episodic task. The defining feature is the explicit architectural and algorithmic support for ingesting, processing, and aligning multiple images or visual modalities at training and test time, in order to extract class-discriminative representations from highly limited data. This article surveys state-of-the-art frameworks as documented in recent literature, spanning classification, segmentation, detection, and multimodal reasoning, with an emphasis on paradigms capable of exploiting multi-image cues, mutual attention, cross-modal prompting, and adaptive fusion for robust generalization.

1. Multi-Image Episodic Frameworks

Few-shot learning benchmarks typically adopt the $N$ -way $K$ -shot episodic paradigm, where, in each episode, $N$ classes are sampled alongside $K$ support images per class. The common challenge is to build a model $f_\theta$ that, given a support set $S=\{(x^s_{i,j}, y_i)\}_{i=1...N,\,j=1...K}$ and a set of query images $Q$ , can accurately assign labels to the queries using knowledge distilled from the few available support examples.

Architectures differ in their management of multi-visual input. Prototype-based meta-learners (e.g., IMAFormer (Jiang et al., 2024), MFNet (Zhang et al., 2021), Label Anything (Marinis et al., 2024)) fuse $K$ support images per class into class prototypes via averaging or learned attention mechanisms. Segmentation and detection models extend the episode structure to pixel or region level, aligning or aggregating support features for pixel-wise or box-level metric comparisons. Multimodal extensions (Flamingo (Alayrac et al., 2022), VT-FSL (Li et al., 29 Sep 2025)) adapt the episodic structure to accommodate interleaved visual and textual supports, processing arbitrary sequences of images and language.

2. Support Set Encoding and Fusion Mechanisms

Efficiently aggregating and fusing multiple support images is critical for extracting robust representations—especially as $K$ and $N$ increase.

Mutual Attention (IMAFormer (Jiang et al., 2024)): Support and query images are partitioned into non-overlapping 16×16 patches and encoded using a pre-trained Vision Transformer (ViT). The class ( $K$ 0) tokens and patch tokens from support and query are “crossed” and processed through a Transformer mutual attention layer: for each support prototype and query, the CLS token attends to the other’s patch tokens, resulting in enhanced, task-conditioned representations. This approach both strengthens intra-class features and disambiguates classes, with final classification based on the cosine similarities of the enhanced CLS tokens.
Multi-Class Prototyping and Attention (MFNet (Zhang et al., 2021)): Each support image is converted to a feature vector via masked pooling, followed by a multi-level attention mechanism: (i) relational attention to modulate and fuse $K$ 1-shot features into per-class prototypes, and (ii) multi-scale attention to merge query and support representations at several resolutions. The model additionally applies a pixel-wise metric learning loss to sharpen the embedding space.
Prompt Pool and Cross-Attention (Label Anything (Marinis et al., 2024)): The framework supports dense (masks), sparse (points/boxes), and hybrid prompts. Each is encoded and fused via multi-headed self-attention. For $K$ 2-way $K$ 3-shot segmentation, class tokens are mixed via attention, and cross-attention decoders link the resulting prototypes to query features for pixel classification.
Unified Gaussian Dense-Anchoring (UGDA (Zhou et al., 2021)): In few-shot settings with partial multi-view input, UGDA estimates per-view feature distributions for each support, samples dense anchors, and aggregates them into a latent space by optimizing the reconstruction error across views. Anchor distribution rectification then imposes a regularized geometry for robust nearest-prototype inference.

Recent advances extend multi-visual input support to multimodal or textual settings for richer class semantics and better generalization.

Cross-Modal Prompting (VT-FSL (Li et al., 29 Sep 2025)): A Cross-modal Iterative Prompting (CIP) protocol conditions an LLM on class names and all $K$ 4 support images, producing visually grounded semantic descriptions. Text-to-image synthesis then augments the support set with high-fidelity, diverse visuals. Cross-modal Geometric Alignment (CGA) unifies class prototype, synthesized visual, and textual embeddings by minimizing the kernelized volume of the parallelotope they span, ensuring global multimodal consistency.
Meta-Learning with LVLMs (Liu et al., 2024): Large Vision LLMs (LVLMs) are tuned by repackaging vision-language datasets into $K$ 5-way $K$ 6-shot meta-tasks. Meta-learning strategies teach the LVLM to extract class-discriminative information from multiple support images, with label augmentation and candidate selection to avoid positional bias and shortcut learning.
Multi-Modal Detection with Cross-Modal Prompting (Han et al., 2022): This approach unifies a metric-based visual classifier (supporting many-shot input) with a prompt-based text classifier. A meta-learned prompt generator synthesizes semantic prompts directly from support images (without class names), and visual and semantic prototypes are fused for two-stage object detection.

4. Objective Functions and Training Protocols

Architectures for multi-visual input FSL adopt specialized objectives and episodic training:

Self-Supervised Pre-Training: ViT and convolutional backbones are pre-trained by masked image modeling (MAE (Jiang et al., 2024)), contrastive learning (Flamingo (Alayrac et al., 2022)), or similar objectives to yield semantically meaningful representations.
Episodic Meta-Training: During meta-training episodes, support and query splits are sampled from base-class pools, and models are trained end-to-end to minimize query cross-entropy loss, using prototypes, mutual attention outputs, or metric classifiers based on embedded support sets.
Auxiliary and Regularization Losses: Pixel-wise triplet loss (MFNet (Zhang et al., 2021)), anchor entropy maximization (UGDA (Zhou et al., 2021)), cross-modal contrastive loss (VT-FSL (Li et al., 29 Sep 2025)), and knowledge distillation (Han et al., 2022) are employed to regularize and align the embedding spaces.
Universal Training Recipes: Frameworks such as Label Anything (Marinis et al., 2024) mix episodes of varying $K$ 7 in a single batch, with architectural components that automatically accommodate dynamic support set sizes and types (mask, point, or bounding-box prompts).

5. Empirical Performance and Scaling Properties

State-of-the-art few-shot algorithms with multi-visual input support consistently advance performance on standard and challenging benchmarks. Notable metrics (mean accuracy and mIoU, all from cited literature):

Model	Task	Backbone	1-shot	5-shot	20-way Scaling
IMAFormer (Jiang et al., 2024)	cls	ViT-B	85.68%	93.28%	–
MFNet (Zhang et al., 2021)	segm	ResNet-50	54.5 mIoU*	59.7*	–
Label Anything (Marinis et al., 2024)	segm	ViT-B/16	43.1 mIoU	45.1	13.7 (N=20)
VT-FSL (Li et al., 29 Sep 2025)	cls	CNN/CLIP	83.66%	88.38%	–
Flamingo (Alayrac et al., 2022)	multi	VLM-70B	VQA: 67.6%	–	–

cls: classification, segm: segmentation, multi: multimodal/MLM.

Scaling experiments with Label Anything (Marinis et al., 2024) demonstrate robust performance for $K$ 8 (13.7 mIoU at 20-way 1-shot), outpacing affinity-based methods as the number of supported classes increases. VT-FSL (Li et al., 29 Sep 2025) and IMAFormer (Jiang et al., 2024) achieve strong results on cross-domain and fine-grained transfer settings.

6. Limitations, Open Challenges, and Compatibility

Several open challenges persist:

View Missing and Heterogeneity: For partial multi-view data, methods like UGDA (Zhou et al., 2021) handle missing modalities by imputation and aggregation, showing reduced degradation compared to naive concatenation or partial learners that require abundant data.
Robustness to Prompt Quality and Data: Model performance is sensitive to prompt informativeness (Label Anything (Marinis et al., 2024)), and architectures are encouraged to exploit multiple prompt types, but masks consistently outperform sparse cues.
Scalability and Universality: Universal training strategies and modular encoders (Label Anything (Marinis et al., 2024), MFNet (Zhang et al., 2021)) grant architectural flexibility across arbitrary $K$ 9 episodic regimes, essential for real-world deployment.
Compatibility: UGDA (Zhou et al., 2021) and related aggregation schemes are externally compatible with existing metric-based few-shot models, providing a drop-in framework for boosting resilience to data scarcity and heterogeneity.

In sum, modern few-shot learning with multi-visual input support integrates episodic multi-image training, mutual and cross-modal attention, universal support encoding, and robust adaptation schemes, delivering state-of-the-art results across classification, segmentation, detection, and multimodal reasoning tasks (Jiang et al., 2024, Zhang et al., 2021, Marinis et al., 2024, Alayrac et al., 2022, Li et al., 29 Sep 2025, Liu et al., 2024, Zhou et al., 2021, Han et al., 2022).