Pre-trained Vision Models

Updated 24 June 2025

Pre-trained vision models are neural architectures whose parameters are optimized in advance, typically using large-scale image or vision-language datasets, before being adapted to specific downstream tasks. These models include both unimodal vision encoders—such as convolutional neural networks (CNNs) and vision transformers (ViTs)—and multimodal vision-LLMs (VLMs) like CLIP and LLaVA, which jointly align visual and textual representations via multi-modal pre-training. Their emergence has enabled flexible and efficient transfer learning across numerous computer vision domains, facilitating new paradigms in annotation, robustness, and weakly supervised learning.

1. Foundations and Training of Pre-trained Vision Models

Pre-trained vision models are trained, often in a self-supervised or weakly supervised manner, on vast datasets to acquire general-purpose visual features. In recent years, ViTs and VLMs have surpassed classic CNNs in pre-training regimes due to their high capacity and ability to align information across modalities. Examples of base models and pretraining strategies include:

CLIP (Contrastive Language-Image Pre-training): Trained to maximize cosine similarity between paired image and text embeddings, enabling strong cross-modal alignment.
LLaVA and GPT-4V: Large-scale multimodal models that integrate sophisticated LLMing (e.g., GPT-4 transformer) with visual understanding, capable of detailed image interpretation and instruction following.

The resulting representations are readily adaptable to downstream tasks, either by full fine-tuning, parameter-efficient tuning (e.g., prompt tuning or adapter layers), or zero-/few-shot learning protocols.

2. Vision-LLM Annotation for Noisy Partial Label Learning

The application of pre-trained vision-LLMs (VLMs) as weak annotators has opened new directions in partial label learning—termed "manual-annotation-free" training. In this paradigm, instead of expert-generated labels, models like CLIP, LLaVA, and GPT-4V are prompted with various templates to predict candidate class labels for each image. Multiple prompt predictions for a sample are aggregated into a candidate label set (e.g., $y = (y_{1}, y_{2}, ..., y_{C})$ with $y_{j}=1$ if label $j$ is a candidate), forming an instance-dependent noisy partial label. This differs fundamentally from symmetric (random) label noise: the errors produced by VLMs systematically reflect their pretraining biases and are often "plausible," sharing underlying semantic patterns.

This automatic annotation process enables scalable dataset construction without human labeling effort, thereby facilitating downstream task training using only publicly available or weakly supervised vision-language resources.

3. Collaborative Consistency Regularization (Co-Reg) and Instance-Dependent Noise

The Co-Reg method addresses the challenge of learning with VLM-generated, instance-dependent noisy partial labels by implementing a dual-network collaborative strategy with consistency regularization in both label and representation spaces.

Core mechanisms:

Co-Teaching via Dual Networks: Two separate neural networks are trained in parallel, each providing pseudo-labels for the other via distributional predictions. This co-pseudo-labeling reduces the risk of each network confirming its own errors (confirmation bias) and better purifies candidate label sets.
Instance-Adaptive Label Partition: A "warm-up" phase with partial cross-entropy and negative entropy regularization helps the networks inductively identify, for each sample, the most reliable piecemeal annotation (partitioned into "partial" and "unlabeled" sets) using loss-based heuristics fit via Gaussian Mixture Modeling (GMM).
Consistency Regularization:
- In the Label Space: The model is trained to match soft pseudo-labels across different augmentation strengths, using cross-entropy for the partial set and mean square error for the unlabeled set.
- In the Feature Space: Maintains momentum (exponential moving average)-updated class prototypes in embedding space and encourages alignment between a sample’s feature and its (fused) pseudo-label distribution via a contrastive or KL-divergence loss.

The overall loss combines regularized label consistency, prototype alignment, and a supervised contrastive loss over model-generated pseudo-labels and features.

4. Integration with Few-Shot Supervision

The collaborative consistency regularization framework is naturally extensible to scenarios where a limited number of ground-truth labels are available ("few-shot"). In this case, a small set of manually annotated labels is integrated into training, which further improves performance. Empirical results show that even one or a few true labels per class, in addition to plentiful VLM-generated noisy partial labels, lead to consistently higher accuracy than few-shot fine-tuning baselines such as CoOp (prompt tuning) for CLIP and LoRA for LLaVA.

5. Empirical Evaluation: Benchmarks, Baselines, and Results

Comprehensive experiments span popular image classification datasets (e.g., CIFAR-10, CIFAR-100, SVHN, Fashion-MNIST, EuroSAT, GTSRB), using various VLMs and downstream model architectures. Results are benchmarked against established noisy label learning (DivideMix), partial label learning (PiCO, ALIM-Onehot, ALIM-Scale, CR-DPLL), and classic fine-tuning (CoOp for CLIP, LoRA for LLaVA).

Key results:

State-of-the-art accuracy across synthetic and VLM-annotated noisy partial label datasets (e.g., 71.04% on CLIP-annotated CIFAR-100, outperforming DivideMix’s 66.03%).
Substantial improvement in low Partial-Acc/high noise settings and on difficult VLM annotation scenarios (e.g., LLaVA-labeled SVHN where ground truth is missing from candidates in ~18% of instances).
Few-shot settings: Hybrid Co-Reg (manual plus VLM partial labels) surpasses few-shot prompt tuning and LoRA baselines, especially noticeable with very few human annotations.
Ablation studies confirm the contribution of each component: dual-network co-pseudo-labeling, prototype alignment, and contrastive regularization each provide measurable gains.

These findings validate both the efficacy of Co-Reg as a denoising strategy for instance- and model-dependent noise and the broader feasibility of annotation-free, VLM-powered model adaptation.

Table: Characteristics of Partial Label Learning Paradigms

Paradigm	Manual Annotations	Target Model Size	Accuracy Gain	Annotation-Free
Zero-shot	No	Large	–	Yes
Few-shot FT	Few	Large	Marginal	No
Unsupervised KD	No	Small	Minor	Yes
Supervised KD	Yes	Small	Significant	No
Full FT	Yes	Large	Significant	No
NPLL (Co-Reg)	No	Small	Significant	Yes

6. Broader Implications and Future Directions

The integration of large pre-trained vision-LLMs as annotators, in conjunction with collaborative, consistency-regularized noisy partial label learning, provides a scalable strategy for developing accurate, lightweight, and specialized vision models without manual annotation. This advances the practical utility of foundation models beyond inference—positioning them as scalable knowledge distillers for efficient model specialization in arbitrary domains.

Potential implications include:

Scalability to New Domains: As VLMs continue to improve, annotation-free frameworks become practical for a wide range of domains—including those with limited or ambiguous human annotations.
Deployment on Resource-Constrained Devices: Enables high-accuracy, low-cost model training for specialized downstream applications with limited compute, bandwidth, or labeling.
Bridging Foundation and Lightweight Models: Co-Reg exemplifies an emerging pattern where massive, high-capacity pre-trained models seed the training of compact production models, especially when paired with weakly or noisily supervised learning strategies.

For further details and reproducibility, datasets, source code, and trained models corresponding to these paradigms are publicly released as described in the original paper.

PDF Markdown Bookmark Chat (Pro)