EVA-CLIP-18B: Open-Source 18B Parameter CLIP Model
- EVA-CLIP-18B is an open-source contrastive language-image pretraining model based on a dual-encoder CLIP framework that scales to 18B parameters.
- The paper introduces a weak-to-strong visual training strategy and demonstrates state-of-the-art zero-shot accuracy of 80.7% across 27 image classification benchmarks.
- Its robust architecture and extensive ablation studies show significant gains in image, video classification, and image-text retrieval tasks.
EVA-CLIP-18B is an open-source contrastive language-image pretraining (CLIP) model scaling the dual-encoder paradigm to 18 billion parameters. Trained on a fixed corpus of approximately 2 billion publicly available image–text pairs from LAION-2B and COYO-700M, with ≈6 billion total image–text "views," it achieves an average zero-shot top-1 accuracy of 80.7% across 27 image classification benchmarks, outperforming all previous open CLIP models and demonstrating consistent performance improvements with model scaling and no observed saturation. EVA-CLIP-18B was developed using the EVA-style weak-to-strong visual model scaling approach and all associated model weights, code, and training recipes are openly released (Sun et al., 2024).
1. Model Architecture
EVA-CLIP-18B implements a dual-encoder design based on the CLIP framework, consisting of a Vision Transformer (ViT) for image encoding, a Transformer-based text encoder, and linear projection heads for modality fusion:
- Vision Encoder:
- 48 transformer layers (), hidden dimension of 5120 (), 20480-dim MLP inner layer, 40 attention heads (per head dimension 128).
- Inputs are images partitioned into patches.
- RMSNorm replaces LayerNorm, and unlike standard ViT, query, key, and value projections have no bias terms (following LLaMA conventions).
- Text Encoder:
- 32 transformer layers (), hidden size 1280 (), 5120-dim MLP inner layer, 20 attention heads (per head dimension 64).
- Initialized from EVA-CLIP-E/14+.
- Projection Heads:
- Linear projections without activation, mapping image and text [CLS] representations to a shared 1024-dimensional “CLIP space.”
- Both outputs undergo normalization.
- Architectural Notation:
where , and are -normalized.
2. Training Objective and Protocol
Contrastive Loss
EVA-CLIP-18B uses the symmetric InfoNCE contrastive loss over mini-batches of image–text pairs :
with (cosine similarity), temperature , and both terms (image-to-text, text-to-image) averaged.
Training Datasets
- Merged-2B: 1.6B LAION-2B pairs plus 0.4B COYO-700M pairs.
- Merged-2B+ (EVA-CLIP-18B+): Merged-2B plus 20M LAION-COCO synthetic captions and 23M Merged-Video samples (VideoCC, InternVid, WebVid-10M) in final epochs.
No dataset filtering was performed apart from public CLIP filtering and download failure exclusions.
Curriculum and Optimization
- Phase 1: 5.4B image–text samples (patch-dropout; 50% of patches masked).
- Phase 2: 0.6B mixed modality (image–text, LAION-COCO, video; varied patch-dropout).
- Optimizer: LAMB (, , no weight decay).
- Peak learning rates: (vision), (text), with cosine decay and layer-wise LR decay ($0.9$ for vision, $0.75$ for text).
- Batching and Precision: Global batch size 108k (224×224), mixed-precision bf16, ZeRO Stage-3, gradient checkpointing, FlashAttention.
- Hardware: Hundreds of A100-80GB equivalents, coordinated via DeepSpeed ZeRO.
3. Empirical Scaling Behavior
Scaling EVA-CLIP models from 5B to 18B parameters on a fixed 2B-image–text pair dataset consistently yielded +0.7–1.0% average zero-shot accuracy per doubling in parameter count, exhibiting a power-law-like trend and no evidence of saturation.
| Model | Parameters | Avg. Zero-Shot Acc. (27 benchmarks) |
|---|---|---|
| EVA-CLIP-E/14+ | 5B | 79.2% |
| DFN-5B (H/14) | 5B | 79.2% |
| WebLI-10B | 10B | 79.8% |
| InternVL-C | 12B | 78.0% |
| EVA-CLIP-18B | 18B | 80.7% |
A plausible implication is that further scaling—either in model size or dataset scale—may yield continued gains (Sun et al., 2024).
4. Zero-Shot Performance Across Modalities
Image Classification
EVA-CLIP-18B attains an average zero-shot top-1 accuracy of 80.7% on 27 benchmark datasets. Key per-dataset results include:
- ImageNet-1K: 85.3%
- ImageNet-V2: 83.1%
- ImageNet-Adversarial: 65.4%
- ObjectNet: 70.2%
- CIFAR-100: 92.5%
These results establish a new state-of-the-art among open CLIP models.
Video Classification
Classification using a single center frame:
- UCF-101: 86.0%
- Kinetics-400: 72.9% (avg. top-1/top-5)
- Kinetics-600: 72.9%
- Kinetics-700: 68.2%
Sampling 8 video frames yields an average performance gain of +4.7% across video benchmarks.
Image–Text Retrieval
On Flickr30K and COCO, EVA-CLIP-18B achieves an average recall (R@1,5,10) of 87.8%, outperforming prior open models by a significant margin.
5. Ablations and Robustness
Multiple ablation studies provide insight into the contributions of various design and training decisions:
- EVA-Style Weak-to-Strong Scaling: Training follows a two-stage procedure: a “weak teacher” (EVA-CLIP-E/14+) distills knowledge to the large 18B-parameter model through masked reconstruction, then proceeds with final contrastive pretraining. This stabilizes large-scale training without requiring an increase in data scale.
- Video Data: Inclusion of 23M video samples at the end of training results in +0.7% (1-frame) and +0.8% (8-frame) improvements in zero-shot video classification, with negligible effect (–0.1%) on average image–text retrieval.
- Resolution: Elevating input resolution from to and improves accuracy by approximately +0.5% and +0.9%, respectively, on select benchmarks.
- Transform Robustness: EVA-CLIP-18B exhibits minimal sensitivity (±0.1%) to changes between direct resize and center-crop evaluation pipelines, surpassing prior models in this robustness aspect.
6. Significance, Limitations, and Future Directions
EVA-CLIP-18B, as the largest open-model instantiation of CLIP to date, demonstrates that consistent performance improvements are achievable via model scaling under a fixed open dataset, challenging the presupposition that further performance gains require substantially larger or proprietary data. The approach exemplifies the effectiveness of EVA-style weak-to-strong progression for stabilizing massive vision–LLMs and extends generalization gains across both image and video modalities as well as cross-modal retrieval.
All model artifacts are openly available, supporting reproducibility and further research (Sun et al., 2024). Potential directions include scaling model size and/or dataset further, multimodal finetuning, instruction tuning, and integration in larger multimodal systems that combine vision encoders with LLMs.