Rationale for LAION cosine-similarity threshold change
Ascertain the rationale used to lower the CLIP image–alt-text cosine similarity filtering threshold to 0.28 during LAION-5B dataset curation, compared to the previously adopted 0.3 threshold for LAION-400M, including the decision-making criteria, empirical evaluations, and any human assessment procedures that justified the departure from the earlier "conservative" value.
References
In the CLIP inference at the post-processing stage section of the LAION-5B dataset announcement, we encounter the fact that the dataset curators estimated the cosine similarity between an image and its alt-text description using the ViT B/32 CLIP model and discarded all images with cosine similarity score of less than the manually set threshold of 0.28. This is a marked departure from the procedure published during the LAION-400M release where the curators stated that "We use OpenAI’s CLIP model (the ‘ViT-B-32‘ version) to compute the image and alt text embeddings. Then we calculate the cosine similarity of both embedding vectors and drop all samples with a similarity below 0.3. We chose this threshold after trying different values and using human evaluations of how well the texts fit the images. Lower values like 0.28 or 0.29 also seemed okay in many cases, but after further inspections, we decided to choose the conservative value of 0.3". The reasoning behind this decision is not clear.