Dice Question Streamline Icon: https://streamlinehq.com

Rationale for LAION cosine-similarity threshold change

Ascertain the rationale used to lower the CLIP image–alt-text cosine similarity filtering threshold to 0.28 during LAION-5B dataset curation, compared to the previously adopted 0.3 threshold for LAION-400M, including the decision-making criteria, empirical evaluations, and any human assessment procedures that justified the departure from the earlier "conservative" value.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper audits visio-linguistic models trained on LAION datasets and highlights that LAION-5B used a CLIP-based filtering threshold of 0.28 for image–text cosine similarity, whereas LAION-400M used 0.3, which had been described as a conservative choice informed by trying different values and human evaluations.

Because filtering thresholds directly affect dataset composition, scale, and downstream model behavior, the authors emphasize that such curation hyperparameters should be justified and documented. They explicitly note that the reasoning behind the change from 0.3 to 0.28 was not made clear, raising an unresolved question about the decision process and its justification.

References

In the CLIP inference at the post-processing stage section of the LAION-5B dataset announcement, we encounter the fact that the dataset curators estimated the cosine similarity between an image and its alt-text description using the ViT B/32 CLIP model and discarded all images with cosine similarity score of less than the manually set threshold of 0.28. This is a marked departure from the procedure published during the LAION-400M release where the curators stated that "We use OpenAI’s CLIP model (the ‘ViT-B-32‘ version) to compute the image and alt text embeddings. Then we calculate the cosine similarity of both embedding vectors and drop all samples with a similarity below 0.3. We chose this threshold after trying different values and using human evaluations of how well the texts fit the images. Lower values like 0.28 or 0.29 also seemed okay in many cases, but after further inspections, we decided to choose the conservative value of 0.3". The reasoning behind this decision is not clear.

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models (2405.04623 - Birhane et al., 7 May 2024) in Section 6 (Discussion and Recommendations), paragraph "Avoid ad-hoc decision-making for dataset curation hyperparameters"