Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets (2506.04598v1)

Published 5 Jun 2025 in cs.LG, cs.AI, and cs.CV

Abstract: In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.

Summary

The paper derives scaling laws that predict the performance of CLIP and MaMMUT as compute and data scale.
It shows that while CLIP excels at smaller scales, MaMMUT achieves superior scalability with increased compute.
It highlights how dataset choice and learning rate schedules critically impact model efficiency and outcomes.

Overview of "Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets"

The paper presents a detailed paper on the application of scaling laws to evaluate and compare foundational language-vision models. It primarily focuses on two models—CLIP and MaMMUT—and assesses their performance when trained on several open datasets, such as DataComp, DFN, and Re-LAION. The paper's significance lies in its systematic approach to leveraging scaling laws to derive insights into model performance across a range of compute scales and sample sizes.

Key Contributions

Scaling Law Derivation: The authors derive scaling laws for CLIP and MaMMUT, offering a predictable framework for assessing model behavior as compute resources or data availability scales. This is pivotal for understanding not just current performance, but future performance potential as model and data scales increase.
Model and Dataset Comparison: Through detailed analysis, the paper reveals that MaMMUT demonstrates superior scalability over CLIP when sufficient compute resources are available, despite CLIP's better performance at smaller scales. This is consistently observed across various datasets and downstream tasks.
Dataset Impact on Model Performance: The work explores how the choice of dataset impacts the scalability of models. DataComp-1.4B, for instance, showcases superior scalability for zero-shot classification tasks compared to Re-LAION-1.4B, while the opposite trend is seen for retrieval tasks. DFN-1.4B consistently outperforms both datasets on multiple tasks, demonstrating its efficacy in enhancing model performance.
Learning Rate Influence: A comparative analysis of learning rate schedulers, specifically cosine and constant, demonstrates that the choice of scheduler considerably impacts computational efficiency during scaling law derivation, with constant learning rate schedules significantly reducing compute costs while maintaining prediction accuracy.
OpenMaMMUT-L-14 Model: Informed by scaling analysis, the authors train the openMaMMUT-L-14 model on 12.8B data samples, achieving state-of-the-art performance in zero-shot and retrieval benchmarks among models trained strictly on open datasets.

Implications and Future Directions

Reproducibility and Transparency: This work highlights the critical need for open datasets and models to facilitate reproducible research and meaningful comparisons in AI. By providing open-source code, models, and datasets, the paper sets a precedent for transparency and collaborative improvement in the field.
Guidance for Future Research: The findings provide a roadmap for researchers seeking to optimize training investments in foundational models. The scaling law framework can guide future developments by indicating the potential returns on scaling compute resources or data size.
Potential for Broader Applications: While this work focuses on language-vision models, the methodologies employed can be generalized to other domains within AI. This opens avenues for cross-disciplinary scaling analysis, enabling broader adaptability and innovation.

In conclusion, the paper's holistic approach to evaluating and comparing foundational models via scaling laws presents a robust framework for understanding and predicting model performance. Its findings emphasize the critical role of open resources in advancing reproducible and scalable AI research. As foundational models continue to grow in complexity and application, this paper offers essential insights into optimizing and deploying these models effectively across varied computational landscapes.