Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
151 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pre-trained Vision Models

Updated 24 June 2025

Pre-trained vision models are neural architectures whose parameters are optimized in advance, typically using large-scale image or vision-language datasets, before being adapted to specific downstream tasks. These models include both unimodal vision encoders—such as convolutional neural networks (CNNs) and vision transformers (ViTs)—and multimodal vision-LLMs (VLMs) like CLIP and LLaVA, which jointly align visual and textual representations via multi-modal pre-training. Their emergence has enabled flexible and efficient transfer learning across numerous computer vision domains, facilitating new paradigms in annotation, robustness, and weakly supervised learning.

1. Foundations and Training of Pre-trained Vision Models

Pre-trained vision models are trained, often in a self-supervised or weakly supervised manner, on vast datasets to acquire general-purpose visual features. In recent years, ViTs and VLMs have surpassed classic CNNs in pre-training regimes due to their high capacity and ability to align information across modalities. Examples of base models and pretraining strategies include:

  • CLIP (Contrastive Language-Image Pre-training): Trained to maximize cosine similarity between paired image and text embeddings, enabling strong cross-modal alignment.
  • LLaVA and GPT-4V: Large-scale multimodal models that integrate sophisticated LLMing (e.g., GPT-4 transformer) with visual understanding, capable of detailed image interpretation and instruction following.

The resulting representations are readily adaptable to downstream tasks, either by full fine-tuning, parameter-efficient tuning (e.g., prompt tuning or adapter layers), or zero-/few-shot learning protocols.

2. Vision-LLM Annotation for Noisy Partial Label Learning

The application of pre-trained vision-LLMs (VLMs) as weak annotators has opened new directions in partial label learning—termed "manual-annotation-free" training. In this paradigm, instead of expert-generated labels, models like CLIP, LLaVA, and GPT-4V are prompted with various templates to predict candidate class labels for each image. Multiple prompt predictions for a sample are aggregated into a candidate label set (e.g., y=(y1,y2,...,yC)y = (y_{1}, y_{2}, ..., y_{C}) with yj=1y_{j}=1 if label jj is a candidate), forming an instance-dependent noisy partial label. This differs fundamentally from symmetric (random) label noise: the errors produced by VLMs systematically reflect their pretraining biases and are often "plausible," sharing underlying semantic patterns.

This automatic annotation process enables scalable dataset construction without human labeling effort, thereby facilitating downstream task training using only publicly available or weakly supervised vision-language resources.

3. Collaborative Consistency Regularization (Co-Reg) and Instance-Dependent Noise

The Co-Reg method addresses the challenge of learning with VLM-generated, instance-dependent noisy partial labels by implementing a dual-network collaborative strategy with consistency regularization in both label and representation spaces.

Core mechanisms:

  • Co-Teaching via Dual Networks: Two separate neural networks are trained in parallel, each providing pseudo-labels for the other via distributional predictions. This co-pseudo-labeling reduces the risk of each network confirming its own errors (confirmation bias) and better purifies candidate label sets.
  • Instance-Adaptive Label Partition: A "warm-up" phase with partial cross-entropy and negative entropy regularization helps the networks inductively identify, for each sample, the most reliable piecemeal annotation (partitioned into "partial" and "unlabeled" sets) using loss-based heuristics fit via Gaussian Mixture Modeling (GMM).
  • Consistency Regularization:
    • In the Label Space: The model is trained to match soft pseudo-labels across different augmentation strengths, using cross-entropy for the partial set and mean square error for the unlabeled set.
    • In the Feature Space: Maintains momentum (exponential moving average)-updated class prototypes in embedding space and encourages alignment between a sample’s feature and its (fused) pseudo-label distribution via a contrastive or KL-divergence loss.

The overall loss combines regularized label consistency, prototype alignment, and a supervised contrastive loss over model-generated pseudo-labels and features.

4. Integration with Few-Shot Supervision

The collaborative consistency regularization framework is naturally extensible to scenarios where a limited number of ground-truth labels are available ("few-shot"). In this case, a small set of manually annotated labels is integrated into training, which further improves performance. Empirical results show that even one or a few true labels per class, in addition to plentiful VLM-generated noisy partial labels, lead to consistently higher accuracy than few-shot fine-tuning baselines such as CoOp (prompt tuning) for CLIP and LoRA for LLaVA.

5. Empirical Evaluation: Benchmarks, Baselines, and Results

Comprehensive experiments span popular image classification datasets (e.g., CIFAR-10, CIFAR-100, SVHN, Fashion-MNIST, EuroSAT, GTSRB), using various VLMs and downstream model architectures. Results are benchmarked against established noisy label learning (DivideMix), partial label learning (PiCO, ALIM-Onehot, ALIM-Scale, CR-DPLL), and classic fine-tuning (CoOp for CLIP, LoRA for LLaVA).

Key results:

  • State-of-the-art accuracy across synthetic and VLM-annotated noisy partial label datasets (e.g., 71.04% on CLIP-annotated CIFAR-100, outperforming DivideMix’s 66.03%).
  • Substantial improvement in low Partial-Acc/high noise settings and on difficult VLM annotation scenarios (e.g., LLaVA-labeled SVHN where ground truth is missing from candidates in ~18% of instances).
  • Few-shot settings: Hybrid Co-Reg (manual plus VLM partial labels) surpasses few-shot prompt tuning and LoRA baselines, especially noticeable with very few human annotations.
  • Ablation studies confirm the contribution of each component: dual-network co-pseudo-labeling, prototype alignment, and contrastive regularization each provide measurable gains.

These findings validate both the efficacy of Co-Reg as a denoising strategy for instance- and model-dependent noise and the broader feasibility of annotation-free, VLM-powered model adaptation.

Table: Characteristics of Partial Label Learning Paradigms

Paradigm Manual Annotations Target Model Size Accuracy Gain Annotation-Free
Zero-shot No Large Yes
Few-shot FT Few Large Marginal No
Unsupervised KD No Small Minor Yes
Supervised KD Yes Small Significant No
Full FT Yes Large Significant No
NPLL (Co-Reg) No Small Significant Yes

6. Broader Implications and Future Directions

The integration of large pre-trained vision-LLMs as annotators, in conjunction with collaborative, consistency-regularized noisy partial label learning, provides a scalable strategy for developing accurate, lightweight, and specialized vision models without manual annotation. This advances the practical utility of foundation models beyond inference—positioning them as scalable knowledge distillers for efficient model specialization in arbitrary domains.

Potential implications include:

  • Scalability to New Domains: As VLMs continue to improve, annotation-free frameworks become practical for a wide range of domains—including those with limited or ambiguous human annotations.
  • Deployment on Resource-Constrained Devices: Enables high-accuracy, low-cost model training for specialized downstream applications with limited compute, bandwidth, or labeling.
  • Bridging Foundation and Lightweight Models: Co-Reg exemplifies an emerging pattern where massive, high-capacity pre-trained models seed the training of compact production models, especially when paired with weakly or noisily supervised learning strategies.

For further details and reproducibility, datasets, source code, and trained models corresponding to these paradigms are publicly released as described in the original paper.