Papers
Topics
Authors
Recent
Search
2000 character limit reached

EVA-CLIP-18B: Open-Source 18B Parameter CLIP Model

Updated 12 March 2026
  • EVA-CLIP-18B is an open-source contrastive language-image pretraining model based on a dual-encoder CLIP framework that scales to 18B parameters.
  • The paper introduces a weak-to-strong visual training strategy and demonstrates state-of-the-art zero-shot accuracy of 80.7% across 27 image classification benchmarks.
  • Its robust architecture and extensive ablation studies show significant gains in image, video classification, and image-text retrieval tasks.

EVA-CLIP-18B is an open-source contrastive language-image pretraining (CLIP) model scaling the dual-encoder paradigm to 18 billion parameters. Trained on a fixed corpus of approximately 2 billion publicly available image–text pairs from LAION-2B and COYO-700M, with ≈6 billion total image–text "views," it achieves an average zero-shot top-1 accuracy of 80.7% across 27 image classification benchmarks, outperforming all previous open CLIP models and demonstrating consistent performance improvements with model scaling and no observed saturation. EVA-CLIP-18B was developed using the EVA-style weak-to-strong visual model scaling approach and all associated model weights, code, and training recipes are openly released (Sun et al., 2024).

1. Model Architecture

EVA-CLIP-18B implements a dual-encoder design based on the CLIP framework, consisting of a Vision Transformer (ViT) for image encoding, a Transformer-based text encoder, and linear projection heads for modality fusion:

  • Vision Encoder:
    • 48 transformer layers (Lv=48L_v = 48), hidden dimension of 5120 (Dv=5120D_v = 5120), 20480-dim MLP inner layer, 40 attention heads (per head dimension 128).
    • Inputs are 224×224224 \times 224 images partitioned into 14×1414 \times 14 patches.
    • RMSNorm replaces LayerNorm, and unlike standard ViT, query, key, and value projections have no bias terms (following LLaMA conventions).
  • Text Encoder:
    • 32 transformer layers (Lt=32L_t = 32), hidden size 1280 (Dt=1280D_t = 1280), 5120-dim MLP inner layer, 20 attention heads (per head dimension 64).
    • Initialized from EVA-CLIP-E/14+.
  • Projection Heads:
    • Linear projections without activation, mapping image and text [CLS] representations to a shared 1024-dimensional “CLIP space.”
    • Both outputs undergo 2\ell_2 normalization.
  • Architectural Notation:

xi=raw image, yi=text caption zv(i)=VisionTransformer(xi)RDvProjvviRd zt(i)=TextTransformer(yi)RDtProjttiRd \begin{align*} x_i &= \text{raw image},\ y_i = \text{text caption} \ z_v(i) &= \text{VisionTransformer}(x_i) \in \mathbb{R}^{D_v} \xrightarrow{\text{Proj}_v} v_i \in \mathbb{R}^d \ z_t(i) &= \text{TextTransformer}(y_i) \in \mathbb{R}^{D_t} \xrightarrow{\text{Proj}_t} t_i \in \mathbb{R}^d \ \end{align*}

where d=1024d = 1024, and vi,tiv_i, t_i are 2\ell_2-normalized.

2. Training Objective and Protocol

Contrastive Loss

EVA-CLIP-18B uses the symmetric InfoNCE contrastive loss over mini-batches of NN image–text pairs {(vi,ti)}n\{(v_i, t_i)\}_{n}:

L=12N[ilogexp(sim(vi,ti)/τ)jexp(sim(vi,tj)/τ)+ilogexp(sim(vi,ti)/τ)jexp(sim(vj,ti)/τ)]L = -\frac{1}{2N} \left[ \sum_{i} \log \frac{\exp(\operatorname{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\operatorname{sim}(v_i, t_j)/\tau)} + \sum_{i} \log \frac{\exp(\operatorname{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\operatorname{sim}(v_j, t_i)/\tau)} \right]

with sim(a,b)=ab\operatorname{sim}(a, b) = a^\top b (cosine similarity), temperature τ=0.01\tau = 0.01, and both terms (image-to-text, text-to-image) averaged.

Training Datasets

  • Merged-2B: 1.6B LAION-2B pairs plus 0.4B COYO-700M pairs.
  • Merged-2B+ (EVA-CLIP-18B+): Merged-2B plus 20M LAION-COCO synthetic captions and 23M Merged-Video samples (VideoCC, InternVid, WebVid-10M) in final epochs.

No dataset filtering was performed apart from public CLIP filtering and download failure exclusions.

Curriculum and Optimization

  • Phase 1: 5.4B image–text samples (patch-dropout; 50% of patches masked).
  • Phase 2: 0.6B mixed modality (image–text, LAION-COCO, video; varied patch-dropout).
  • Optimizer: LAMB (β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, no weight decay).
  • Peak learning rates: 4×1044 \times 10^{-4} (vision), 4×1054 \times 10^{-5} (text), with cosine decay and layer-wise LR decay ($0.9$ for vision, $0.75$ for text).
  • Batching and Precision: Global batch size 108k (224×224), mixed-precision bf16, ZeRO Stage-3, gradient checkpointing, FlashAttention.
  • Hardware: Hundreds of A100-80GB equivalents, coordinated via DeepSpeed ZeRO.

3. Empirical Scaling Behavior

Scaling EVA-CLIP models from 5B to 18B parameters on a fixed 2B-image–text pair dataset consistently yielded +0.7–1.0% average zero-shot accuracy per doubling in parameter count, exhibiting a power-law-like trend and no evidence of saturation.

Model Parameters Avg. Zero-Shot Acc. (27 benchmarks)
EVA-CLIP-E/14+ 5B 79.2%
DFN-5B (H/14) 5B 79.2%
WebLI-10B 10B 79.8%
InternVL-C 12B 78.0%
EVA-CLIP-18B 18B 80.7%

A plausible implication is that further scaling—either in model size or dataset scale—may yield continued gains (Sun et al., 2024).

4. Zero-Shot Performance Across Modalities

Image Classification

EVA-CLIP-18B attains an average zero-shot top-1 accuracy of 80.7% on 27 benchmark datasets. Key per-dataset results include:

  • ImageNet-1K: 85.3%
  • ImageNet-V2: 83.1%
  • ImageNet-Adversarial: 65.4%
  • ObjectNet: 70.2%
  • CIFAR-100: 92.5%

These results establish a new state-of-the-art among open CLIP models.

Video Classification

Classification using a single center frame:

  • UCF-101: 86.0%
  • Kinetics-400: 72.9% (avg. top-1/top-5)
  • Kinetics-600: 72.9%
  • Kinetics-700: 68.2%

Sampling 8 video frames yields an average performance gain of +4.7% across video benchmarks.

Image–Text Retrieval

On Flickr30K and COCO, EVA-CLIP-18B achieves an average recall (R@1,5,10) of 87.8%, outperforming prior open models by a significant margin.

5. Ablations and Robustness

Multiple ablation studies provide insight into the contributions of various design and training decisions:

  • EVA-Style Weak-to-Strong Scaling: Training follows a two-stage procedure: a “weak teacher” (EVA-CLIP-E/14+) distills knowledge to the large 18B-parameter model through masked reconstruction, then proceeds with final contrastive pretraining. This stabilizes large-scale training without requiring an increase in data scale.
  • Video Data: Inclusion of 23M video samples at the end of training results in +0.7% (1-frame) and +0.8% (8-frame) improvements in zero-shot video classification, with negligible effect (–0.1%) on average image–text retrieval.
  • Resolution: Elevating input resolution from 2242224^2 to 3362336^2 and 4482448^2 improves accuracy by approximately +0.5% and +0.9%, respectively, on select benchmarks.
  • Transform Robustness: EVA-CLIP-18B exhibits minimal sensitivity (±0.1%) to changes between direct resize and center-crop evaluation pipelines, surpassing prior models in this robustness aspect.

6. Significance, Limitations, and Future Directions

EVA-CLIP-18B, as the largest open-model instantiation of CLIP to date, demonstrates that consistent performance improvements are achievable via model scaling under a fixed open dataset, challenging the presupposition that further performance gains require substantially larger or proprietary data. The approach exemplifies the effectiveness of EVA-style weak-to-strong progression for stabilizing massive vision–LLMs and extends generalization gains across both image and video modalities as well as cross-modal retrieval.

All model artifacts are openly available, supporting reproducibility and further research (Sun et al., 2024). Potential directions include scaling model size and/or dataset further, multimodal finetuning, instruction tuning, and integration in larger multimodal systems that combine vision encoders with LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EVA-CLIP-18B.