Papers
Topics
Authors
Recent
2000 character limit reached

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models (2405.10286v1)

Published 16 May 2024 in cs.CV and cs.AI

Abstract: Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16430–16441, 2022.
  2. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
  3. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  4. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  5. Prototypical contrastive language image pretraining. arXiv preprint arXiv:2206.10996, 2022.
  6. Incremental false negative detection for contrastive learning. arXiv preprint arXiv:2106.03719, 2021.
  7. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  8. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  10. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431, 2021.
  11. Improving clip training with language rewrites. arXiv preprint arXiv:2305.20088, 2023.
  12. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  13. Improved baselines for vision-language pre-training. arXiv preprint arXiv:2305.08675, 2023.
  14. Cloob: Modern hopfield networks with infoloob outperform clip. Advances in neural information processing systems, 35:20450–20468, 2022.
  15. Softclip: Softer cross-modal alignment makes clip stronger. arXiv preprint arXiv:2303.17561, 2023.
  16. Hiclip: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995, 2023.
  17. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems, 35:6704–6719, 2022.
  18. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020.
  19. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 897–905, 2022.
  20. Boosting contrastive self-supervised learning with false negative cancellation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2785–2795, 2022.
  21. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  22. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  23. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  24. Learning multiple layers of features from tiny images. 2009.
  25. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020.
  26. Uniclip: Unified framework for contrastive language-image pre-training. Advances in Neural Information Processing Systems, 35:1008–1019, 2022.
  27. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021a.
  28. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  29. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  30. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021b.
  31. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  32. Fixing weight decay regularization in adam. 2018.
  33. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  34. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pages 529–544. Springer, 2022.
  35. Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
  36. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  37. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  38. Automatic differentiation in pytorch. 2017.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Is a caption worth a thousand images? a study on representation learning. In The Eleventh International Conference on Learning Representations, 2022.
  41. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  42. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  43. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  44. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449, 2021.
  45. A fistful of words: Learning transferable visual models from bag-of-words supervision. arXiv preprint arXiv:2112.13884, 2021.
  46. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  47. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  48. Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
  49. Data efficient language-supervised zero-shot recognition with optimal transport distillation. arXiv preprint arXiv:2112.09445, 2021.
  50. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  51. Alip: Adaptive language-image pre-training with synthetic caption. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2922–2931, 2023.
  52. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  53. Learning visual representation from modality-shared contrastive language-image pre-training. In European Conference on Computer Vision, pages 69–87. Springer, 2022.
  54. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  55. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
Citations (1)

Summary

  • The paper introduces dynamic correction of negative pair assignments by adjusting image-text, image-image, and text-text similarities to improve pre-training.
  • It enriches training data with multiple pseudo-captions from BLIP2, enhancing caption diversity and quality in noisy web datasets.
  • Using a robust sigmoid loss, the method yields up to 6% recognition improvements and 19% retrieval gains on benchmark datasets.

Essay on "FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-LLMs"

The paper "FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-LLMs" by Bulat, Ouali, and Tzimiropoulos presents a meticulous examination of two crucial impediments in vision-language contrastive pre-training: the erroneous assignment of negative pairs and the low quality and diversity of captions. The authors propose innovative solutions to these problems, achieving significant performance improvements in image recognition and retrieval tasks.

Key Contributions

1. Analysis of Flaws in Pre-training Data

The paper begins with an in-depth analysis of the common pitfalls in vision-language contrastive pre-training, particularly focusing on noise and caption quality. The authors identify two primary issues: the incorrect assignment of negative pairs due to the presence of near-duplicate samples and the low quality and diversity of captions extracted from web-collected datasets. These issues, while recognized in previous literature, have not been fully addressed.

2. Correcting Negative Pair Assignments

To address the incorrect assignment of negative pairs, the authors propose an algorithm that dynamically adjusts negative assignments by considering image-text, image-image, and text-text similarities. This multi-faceted approach ensures that semantically similar pairs that are incorrectly treated as negatives are re-assigned as positives. This on-the-fly correction is a significant improvement over traditional methods that statically assign negatives without accounting for semantic similarities.

3. Improving Caption Quality and Diversity

The authors tackle the problem of low caption quality and diversity by generating multiple pseudo-captions for each image using the state-of-the-art captioning model, BLIP2. These synthetic captions are used to augment the training batches, providing a richer and more diverse set of positive samples. This approach effectively mitigates the impact of noisy and repetitive captions that are prevalent in web-collected datasets.

4. Use of Sigmoid Loss for Training

Given the requirement to handle a variable number of positive pairs per image, the authors propose training the model using a sigmoid loss, which is more robust to noise compared to the traditional contrastive loss. This adaptation allows the model to dynamically adjust to the varying number of positives and enhances its robustness against errors in the mining process.

Numerical Results

The proposed method demonstrates substantial improvements over the state-of-the-art across multiple benchmark datasets. For image recognition, the authors report an average gain of approximately 6% over 11 datasets, with image retrieval performance improving by 19% and 15% on Flickr30k and MSCOCO, respectively. These numerical results underscore the efficacy of the proposed solutions in enhancing vision-language representations.

Analysis of Theoretical and Practical Implications

The introduction of dynamic negative pair correction and enhanced caption generation techniques holds significant theoretical and practical implications. Theoretically, the work advances the understanding of how semantic similarities can be leveraged to improve contrastive learning. Practically, the proposed methods can be integrated into existing vision-LLMs to achieve higher accuracy with minimal additional computational cost. By alleviating the noise inherent in web-collected datasets, the proposed solutions pave the way for more robust and generalizable vision-LLMs.

Future Directions

The findings presented in this paper open several avenues for future research. One potential direction is the exploration of alternative caption generation models and their impact on the diversity and quality of training data. Additionally, further investigation into the scalability of the proposed methods across larger and more diverse datasets could provide deeper insights into their generalizability. Finally, exploring the integration of these techniques with more advanced architectures, such as transformers, could yield even more potent vision-LLMs.

In conclusion, the paper by Bulat, Ouali, and Tzimiropoulos makes a substantial contribution to the field of vision-language pre-training. By addressing the critical issues of negative pair assignment and caption quality, the authors achieve significant performance gains, demonstrating the potential of their methods to advance the state of the art in image recognition and retrieval tasks. The proposed solutions are both theoretically sound and practically viable, offering a robust framework for building stronger vision-LLMs.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.