Persistence of CLIMP advantages at large data and model scales

Determine whether the performance advantages of CLIMP—a fully Mamba-based contrastive vision-language model using VMamba for vision and Mamba-1/2 for text—persist when scaling training data to the LAION-2B dataset and/or scaling the vision backbone to ViT-L and ViT-H model sizes.

Background

The paper introduces CLIMP, a fully Mamba-based alternative to Transformer-based CLIP, demonstrating improved retrieval performance, out-of-distribution robustness, and efficiency when trained on CC12M with base-sized VMamba vision encoders and Mamba text encoders.

Despite positive results and scaling experiments on smaller datasets and models, the authors explicitly note uncertainty about whether CLIMP’s advantages will hold at substantially larger data scales (e.g., LAION-2B) and larger model sizes (e.g., ViT-L/H-class backbones). Clarifying this would establish whether CLIMP’s observed benefits generalize to industry-scale regimes typical of state-of-the-art CLIP variants.

References

While our scaling experiments suggest continued improvements, it remains to be verified whether CLIMP's advantages persist at the scale of LAION-2B or ViT-L/H architectures.

CLIMP: Contrastive Language-Image Mamba Pretraining  (2601.06891 - Shabtay et al., 11 Jan 2026) in Section: Limitations