EVA-CLIP-18B: Open-Source 18B Parameter CLIP Model

Updated 12 March 2026

EVA-CLIP-18B is an open-source contrastive language-image pretraining model based on a dual-encoder CLIP framework that scales to 18B parameters.
The paper introduces a weak-to-strong visual training strategy and demonstrates state-of-the-art zero-shot accuracy of 80.7% across 27 image classification benchmarks.
Its robust architecture and extensive ablation studies show significant gains in image, video classification, and image-text retrieval tasks.

EVA-CLIP-18B is an open-source contrastive language-image pretraining (CLIP) model scaling the dual-encoder paradigm to 18 billion parameters. Trained on a fixed corpus of approximately 2 billion publicly available image–text pairs from LAION-2B and COYO-700M, with ≈6 billion total image–text "views," it achieves an average zero-shot top-1 accuracy of 80.7% across 27 image classification benchmarks, outperforming all previous open CLIP models and demonstrating consistent performance improvements with model scaling and no observed saturation. EVA-CLIP-18B was developed using the EVA-style weak-to-strong visual model scaling approach and all associated model weights, code, and training recipes are openly released (Sun et al., 2024).

1. Model Architecture

EVA-CLIP-18B implements a dual-encoder design based on the CLIP framework, consisting of a Vision Transformer (ViT) for image encoding, a Transformer-based text encoder, and linear projection heads for modality fusion:

Vision Encoder:
- 48 transformer layers ( $L_v = 48$ ), hidden dimension of 5120 ( $D_v = 5120$ ), 20480-dim MLP inner layer, 40 attention heads (per head dimension 128).
- Inputs are $224 \times 224$ images partitioned into $14 \times 14$ patches.
- RMSNorm replaces LayerNorm, and unlike standard ViT, query, key, and value projections have no bias terms (following LLaMA conventions).
Text Encoder:
- 32 transformer layers ( $L_t = 32$ ), hidden size 1280 ( $D_t = 1280$ ), 5120-dim MLP inner layer, 20 attention heads (per head dimension 64).
- Initialized from EVA-CLIP-E/14+.
Projection Heads:
- Linear projections without activation, mapping image and text [CLS] representations to a shared 1024-dimensional “CLIP space.”
- Both outputs undergo $\ell_2$ normalization.
Architectural Notation:

$\begin{align*} x_i &= \text{raw image},\ y_i = \text{text caption} \ z_v(i) &= \text{VisionTransformer}(x_i) \in \mathbb{R}^{D_v} \xrightarrow{\text{Proj}_v} v_i \in \mathbb{R}^d \ z_t(i) &= \text{TextTransformer}(y_i) \in \mathbb{R}^{D_t} \xrightarrow{\text{Proj}_t} t_i \in \mathbb{R}^d \ \end{align*}$

where $d = 1024$ , and $v_i, t_i$ are $\ell_2$ -normalized.

2. Training Objective and Protocol

Contrastive Loss

EVA-CLIP-18B uses the symmetric InfoNCE contrastive loss over mini-batches of $N$ image–text pairs $\{(v_i, t_i)\}_{n}$ :

$L = -\frac{1}{2N} \left[ \sum_{i} \log \frac{\exp(\operatorname{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\operatorname{sim}(v_i, t_j)/\tau)} + \sum_{i} \log \frac{\exp(\operatorname{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\operatorname{sim}(v_j, t_i)/\tau)} \right]$

with $\operatorname{sim}(a, b) = a^\top b$ (cosine similarity), temperature $\tau = 0.01$ , and both terms (image-to-text, text-to-image) averaged.

Training Datasets

Merged-2B: 1.6B LAION-2B pairs plus 0.4B COYO-700M pairs.
Merged-2B+ (EVA-CLIP-18B+): Merged-2B plus 20M LAION-COCO synthetic captions and 23M Merged-Video samples (VideoCC, InternVid, WebVid-10M) in final epochs.

No dataset filtering was performed apart from public CLIP filtering and download failure exclusions.

Curriculum and Optimization

Phase 1: 5.4B image–text samples (patch-dropout; 50% of patches masked).
Phase 2: 0.6B mixed modality (image–text, LAION-COCO, video; varied patch-dropout).
Optimizer: LAMB ( $\beta_1=0.9$ , $\beta_2=0.95$ , no weight decay).
Peak learning rates: $4 \times 10^{-4}$ (vision), $4 \times 10^{-5}$ (text), with cosine decay and layer-wise LR decay ($0.9$ for vision, $0.75$ for text).
Batching and Precision: Global batch size 108k (224×224), mixed-precision bf16, ZeRO Stage-3, gradient checkpointing, FlashAttention.
Hardware: Hundreds of A100-80GB equivalents, coordinated via DeepSpeed ZeRO.

3. Empirical Scaling Behavior

Scaling EVA-CLIP models from 5B to 18B parameters on a fixed 2B-image–text pair dataset consistently yielded +0.7–1.0% average zero-shot accuracy per doubling in parameter count, exhibiting a power-law-like trend and no evidence of saturation.

Model	Parameters	Avg. Zero-Shot Acc. (27 benchmarks)
EVA-CLIP-E/14+	5B	79.2%
DFN-5B (H/14)	5B	79.2%
WebLI-10B	10B	79.8%
InternVL-C	12B	78.0%
EVA-CLIP-18B	18B	80.7%

A plausible implication is that further scaling—either in model size or dataset scale—may yield continued gains (Sun et al., 2024).

4. Zero-Shot Performance Across Modalities

Image Classification

EVA-CLIP-18B attains an average zero-shot top-1 accuracy of 80.7% on 27 benchmark datasets. Key per-dataset results include:

ImageNet-1K: 85.3%
ImageNet-V2: 83.1%
ImageNet-Adversarial: 65.4%
ObjectNet: 70.2%
CIFAR-100: 92.5%

These results establish a new state-of-the-art among open CLIP models.

Video Classification

Classification using a single center frame:

UCF-101: 86.0%
Kinetics-400: 72.9% (avg. top-1/top-5)
Kinetics-600: 72.9%
Kinetics-700: 68.2%

Sampling 8 video frames yields an average performance gain of +4.7% across video benchmarks.

Image–Text Retrieval

On Flickr30K and COCO, EVA-CLIP-18B achieves an average recall (R@1,5,10) of 87.8%, outperforming prior open models by a significant margin.

5. Ablations and Robustness

Multiple ablation studies provide insight into the contributions of various design and training decisions:

EVA-Style Weak-to-Strong Scaling: Training follows a two-stage procedure: a “weak teacher” (EVA-CLIP-E/14+) distills knowledge to the large 18B-parameter model through masked reconstruction, then proceeds with final contrastive pretraining. This stabilizes large-scale training without requiring an increase in data scale.
Video Data: Inclusion of 23M video samples at the end of training results in +0.7% (1-frame) and +0.8% (8-frame) improvements in zero-shot video classification, with negligible effect (–0.1%) on average image–text retrieval.
Resolution: Elevating input resolution from $224^2$ to $336^2$ and $448^2$ improves accuracy by approximately +0.5% and +0.9%, respectively, on select benchmarks.
Transform Robustness: EVA-CLIP-18B exhibits minimal sensitivity (±0.1%) to changes between direct resize and center-crop evaluation pipelines, surpassing prior models in this robustness aspect.

6. Significance, Limitations, and Future Directions

EVA-CLIP-18B, as the largest open-model instantiation of CLIP to date, demonstrates that consistent performance improvements are achievable via model scaling under a fixed open dataset, challenging the presupposition that further performance gains require substantially larger or proprietary data. The approach exemplifies the effectiveness of EVA-style weak-to-strong progression for stabilizing massive vision–LLMs and extends generalization gains across both image and video modalities as well as cross-modal retrieval.

All model artifacts are openly available, supporting reproducibility and further research (Sun et al., 2024). Potential directions include scaling model size and/or dataset further, multimodal finetuning, instruction tuning, and integration in larger multimodal systems that combine vision encoders with LLMs.

Markdown Report Issue Upgrade to Chat

References (1)

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EVA-CLIP-18B.

EVA-CLIP-18B: Open-Source 18B Parameter CLIP Model

1. Model Architecture

2. Training Objective and Protocol

Contrastive Loss

Training Datasets

Curriculum and Optimization

3. Empirical Scaling Behavior

4. Zero-Shot Performance Across Modalities

Image Classification

Video Classification

Image–Text Retrieval

5. Ablations and Robustness

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EVA-CLIP-18B: Open-Source 18B Parameter CLIP Model

1. Model Architecture

2. Training Objective and Protocol

Contrastive Loss

Training Datasets

Curriculum and Optimization

3. Empirical Scaling Behavior

4. Zero-Shot Performance Across Modalities

Image Classification

Video Classification

Image–Text Retrieval

5. Ablations and Robustness

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research