ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO (2204.03359v5)
Abstract: Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models. However, existing ITM benchmarks have a significant limitation. They have many missing correspondences, originating from the data construction process itself. For example, a caption is only matched with one image although the caption can be matched with other similar images and vice versa. To correct the massive false negatives, we construct the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators. We employ five state-of-the-art ITM models with diverse properties for our annotation process. Our dataset provides x3.6 positive image-to-caption associations and x8.5 caption-to-image associations compared to the original MS-COCO. We also propose to use an informative ranking-based metric mAP@R, rather than the popular Recall@K (R@K). We re-evaluate the existing 25 VL models on existing and proposed benchmarks. Our findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K R@K, CxC R@1 are highly correlated with each other, while the rankings change when we shift to the ECCV mAP@R. Lastly, we delve into the effect of the bias introduced by the choice of machine annotator. Source code and dataset are available at https://github.com/naver-ai/eccv-caption
- Microsoft coco: Common objects in context. In Proc. ECCV, 2014.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proc. CVPR, pages 3558–3568, 2021.
- Devise: A deep visual-semantic embedding model. In Proc. NeurIPS, pages 2121–2129, 2013.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, 2:67–78, 2014.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- VSE++: Improving visual-semantic embeddings with hard negatives. In Proc. BMVC, 2018.
- Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7181–7189, 2018.
- Stacked cross attention for image-text matching. In Proc. ECCV, 2018.
- Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6163–6171, 2018.
- Visual semantic reasoning for image-text matching. In Proc. ICCV, pages 4654–4662, 2019.
- Polysemous visual-semantic embedding for cross-modal retrieval. In Proc. CVPR, pages 1979–1988, 2019.
- Language-agnostic visual-semantic embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5804–5813, 2019.
- Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6609–6618, 2019.
- Consensus-aware visual-semantic embedding for image-text matching. In Proc. ECCV, 2020.
- Adaptive offline quintuplet loss for image-text matching. In Proc. ECCV, 2020.
- Similarity reasoning and filtration for image-text matching. In Proc. AAAI, 2021.
- Probabilistic embeddings for cross-modal retrieval. In Proc. CVPR, 2021.
- Learning the best pooling strategy for visual semantic embedding. In Proc. CVPR, 2021.
- Learning with noisy correspondence for cross-modal matching. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Proc. NeurIPS, 2021. URL https://openreview.net/forum?id=S9ZyhWC17wJ.
- Is an image worth five sentences? a new look into semantics for image-text matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1391–1400, 2022.
- Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proc. ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
- RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
- The caltech-ucsd birds-200-2011 dataset, 2011.
- 3d object representations for fine-grained categorization. In Proc. CVPR Worshops, pages 554–561, 2013.
- Deep metric learning via lifted structured feature embedding. In Proc. CVPR, pages 4004–4012, 2016.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proc. CVPR, pages 1096–1104, 2016.
- A metric learning reality check. In Proc. ECCV, 2020.
- Vilt: Vision-and-language transformer without convolution or region supervision. In Proc. ICML, 2021.
- Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for ms-coco. arXiv preprint arXiv:2004.15020, 2020.
- Universal sentence encoder. In Proc. EMNLP, 2018.
- Glove: Global vectors for word representation. In Proc. EMNLP, pages 1532–1543, 2014.
- Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 254–263, Honolulu, Hawaii, October 2008. Association for Computational Linguistics. URL https://aclanthology.org/D08-1027.
- Utility data annotation with amazon mechanical turk. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8, 2008. doi: 10.1109/CVPRW.2008.4562953.
- Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67, 2010.
- A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6), July 2021. ISSN 0360-0300. doi: 10.1145/3457607. URL https://doi.org/10.1145/3457607.
- Which shortcut cues will dnns choose? a study from the parameter-space perspective. In International Conference on Learning Representations (ICLR), 2022.
- Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Proc. ICCV, volume 1, pages 105–112. IEEE, 2001.
- Burr Settles. Active learning literature survey, 2009.
- Deep interactive object selection. In Proc. CVPR, pages 373–381, 2016.
- Large-scale interactive object segmentation with human annotators. In Proc. CVPR, pages 11700–11709, 2019.
- Recipescape: An interactive tool for analyzing cooking instructions at scale. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2018.
- Vizlens: A robust and interactive screen reader for interfaces in the real world. Proceedings of the 29th Annual Symposium on User Interface Software and Technology, 2016.
- Towards accountable ai: Hybrid human-machine analyses for characterizing system failure. In HCOMP, 2018.
- Smartlabel: An object labeling tool using iterated harmonic energy minimization. In Proceedings of the 14th ACM international conference on Multimedia, pages 891–900, 2006.
- Yashaswi Verma and CV Jawahar. Image annotation by propagating labels from semantic neighbourhoods. IJCV, 121(1):126–148, 2017.
- Fluid annotation: a human-machine collaboration interface for full image annotation. In Proceedings of the 26th ACM international conference on Multimedia, pages 1957–1966, 2018.
- The open images dataset v4. IJCV, 128(7):1956–1981, 2020.
- Striving to earn more: A survey of work strategies and tool use among crowd workers. In HCOMP, 2018.
- Two tools are better than one: Tool diversity as a means of improving aggregate crowd performance. 23rd International Conference on Intelligent User Interfaces, 2018.
- Efficient elicitation approaches to estimate collective crowd answers. Proceedings of the ACM on Human-Computer Interaction, 3:1 – 25, 2019.
- Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 313–322, 2010.
- Crowdsourcing step-by-step information extraction to enhance existing how-to videos. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2014.
- On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
- Attention is all you need. In Proc. NeurIPS, pages 5998–6008, 2017.
- Deep residual learning for image recognition. In Proc. CVPR, 2016.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. NeurIPS, pages 91–99, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. ICLR, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proc. ICCV, 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proc. CVPR, pages 3128–3137, 2015.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proc. CVPR, 2018.
- Vinvl: Making visual representations matter in vision-language models. In Proc. CVPR, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
- Facenet: A unified embedding for face recognition and clustering. In Proc. CVPR, pages 815–823, 2015.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
- Exploring the limits of weakly supervised pretraining. In Proc. ECCV, 2018a.
- Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In Proc. ICLR, 2021.
- Sgdr: Stochastic gradient descent with warm restarts. In Proc. ICLR, 2017.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
- Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, pages 7538–7550, 2018.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Exploring the limits of weakly supervised pretraining. In Proc. ECCV, pages 181–196, 2018b.
- Revisiting weakly supervised pre-training of visual perception models, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proc. ICML, pages 4904–4916. PMLR, 2021.
- Re-labeling imagenet: from single to multi-labels, from global to localized labels. In Proc. CVPR, 2021.
- mixup: Beyond empirical risk minimization. In Proc. ICLR, 2018.
- Autoaugment: Learning augmentation strategies from data. In Proc. CVPR, pages 113–123, 2019.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proc. CVPR Worshops, pages 702–703, 2020.
- e-vil: A dataset and benchmark for natural language explanations in vision-language tasks, 2021.
- A survey of image labelling for computer vision applications. Journal of Business Analytics, pages 1–20, 2021.
- Probability and statistics for engineers, volume 2000. Pearson Education London, 2000.
- Christopher M Bishop. Pattern recognition. Machine learning, 128(9), 2006.
- Scott Plous. The psychology of judgment and decision making. Mcgraw-Hill Book Company, 1993.
- Publication bias in clinical research. The Lancet, 337(8746):867–872, 1991.
- Abraham wald’s work on aircraft survivability. Journal of the American Statistical Association, 79(386):259–267, 1984.