Polos: Multimodal Metric Learning from Human Feedback for Image Captioning (2402.18091v1)
Abstract: Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.
- From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge. arXiv preprint arXiv:1511.03292, 2015.
- SemEval-2014 Task 10: Multilingual Semantic Textual Similarity. In SemEval, pages 81–91, 2014.
- SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. In SemEval, pages 252–263, 2015.
- SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. In SemEval, pages 497–511, 2016.
- SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. In SemEval, pages 385–393, 2012.
- SEM 2013 Shared Task: Semantic Textual Similarity. In SEM, pages 32–43, 2013.
- nocaps: Novel Object Captioning at Scale. In ICCV, pages 8948–8957, 2019.
- Multi-Modal Image Captioning for the Visually Impaired. In NAACL-HLT, pages 53–60, 2021.
- Flamingo: A Visual Language Model for Few-shot Learning. In NeurIPS, volume 35, pages 23716–23736, 2022.
- SPICE: Semantic Propositional Image Caption Evaluation. In ECCV, pages 382–398, 2016.
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR, pages 6077–6086, 2018.
- Satanjeev Banerjee et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL, pages 65–72, 2005.
- SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In SemEval, pages 1–14, 2017.
- UNITER: Universal Image-text Representation Learning. In ECCV, pages 104–120, 2020.
- Scaling Instruction-finetuned Language Models. arXiv preprint arXiv:2210.11416, 2022.
- Unsupervised Cross-lingual Representation Learning at Scale. In ACL, pages 8440–8451, 2020.
- Meshed-Memory Transformer for Image Captioning. In CVPR, pages 10578–10587, 2020.
- Learning to Evaluate Image Captioning. In CVPR, pages 5804–5812, 2018.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, pages 4171–4186, 2019.
- SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis. In IJCNLP, pages 2250–2260, 2021.
- CapWAP: Image Captioning with a Purpose. In EMNLP, pages 8755–8768, 2020.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP, pages 6894–6910, 2021.
- Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In ICLR, 2022.
- Captioning Images Taken by People Who Are Blind. In ECCV, pages 417–434, 2020.
- Image Captioning: Transforming Objects into Words. In NeurIPS, volume 32, pages 11137–11147, 2019.
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In EMNLP, pages 7514–7528, 2021.
- Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. JAIR, 47:853–899, 2013.
- Attention on Attention for Image Captioning. In ICCV, pages 4634–4643, 2019.
- TIGEr: Text-to-image Grounding For Image Caption Evaluation. In EMNLP, 2019.
- Motonari Kambara et al. Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions. IEEE RAL, 6:8371–8378, 2021.
- Jin Kim et al. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. In NeurIPS, volume 35, pages 35072–35086, 2022.
- PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning. In EMNLP, pages 12237–12258, 2023.
- From Word Embeddings To Document Distances. PMLR, 37:957–966, 2015.
- ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT. In Eval4NLP, pages 34–39.
- UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. In ACL, pages 220–226, 2021.
- Quality Estimation for Image Captions Based on Large-scale Human Evaluations. In NAACL, pages 3157–3166, 2021.
- Junnan Li et al. BLIP: Bootstrapping Language-image Pre-training for Unified Vision-language Understanding and Generation. In ICML, pages 12888–12900, 2022.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML, 2023.
- ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning. In IJCAI, pages 1081–1087, 2022.
- Chin Lin. ROUGE: A Package For Automatic Evaluation Of Summaries. In ACL, pages 74–81, 2004.
- Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755, 2014.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692, 2019.
- Dual-Level Collaborative Transformer for Image Captioning. In AAAI, volume 35, pages 2286–2293, 2021.
- LENS: A learnable evaluation metric for text simplification. In ACL, pages 16383–16408, July 2023.
- Multimodal Attention Branch Network for Perspective-Free Sentence Generation. In CoRL, pages 76–85, 2019.
- A SICK Cure for the Evaluation of Compositional Distributional Semantic Models. In LREC, pages 216–223, 2014.
- Visuals to Text: A Comprehensive Review on Automatic Image Captioning. JAS, 9(8):1339–1365, 2022.
- CIDEr-R: Robust Consensus-based Image Description Evaluation. In W-NUT, pages 351–360, 2021.
- BLEU: a Method for Automatic Evaluation of Machine Translation. In ACL, pages 311–318, 2002.
- Learning Transferable Visual Models from Natural Language Supervision. In ICML, pages 8748–8763, 2021.
- COMET: A Neural Framework for MT Evaluation. In EMNLP, pages 2685–2702, 2020.
- Self-critical Sequence Training for Image Captioning. In CVPR, pages 7008–7024, 2017.
- Sara Sarto et al. Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In CVPR, pages 6914–6924, 2023.
- BLEURT: Learning Robust Metrics for Text Generation. In ACL, pages 7881–7892, 2020.
- Piyush Sharma et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning. In ACL, pages 2556–2565, 2018.
- FOIL it! Find One Mismatch Between Image and Language caption. In ACL, pages 255–265, 2017.
- Hiroki Shimanaka et al. RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation. In WMT18, pages 751–758, 2018.
- TextCaps: A Dataset for Image Captioning with Reading Comprehension. In ECCV, pages 742–758, 2020.
- From Show to Tell: A Survey on Deep Learning-based Image Captioning. PAMI, 45(1):539–559, 2022.
- GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features. In ECCV, pages 167–184, 2022.
- Attention Is All You Need. In NIPS, volume 30, pages 5998–6008, 2017.
- CIDEr: Consensus-based Image Description Evaluation. In CVPR, pages 4566–4575, 2015.
- JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models. In CoNLL, 2023.
- GIT: A Generative Image-to-text Transformer for Vision and Language. TMLR, 2022.
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-sequence Learning Framework. In ICML, pages 23318–23340, 2022.
- Open-domain Clarification Question Generation Without Question Examples. In EMNLP, pages 563–570, 2021.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, pages 2048–2057, 2015.
- SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes. In ACL, pages 5166–5183, 2023.
- Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis. In EMNLP, pages 6559–6574, 2022.
- INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback. In EMNLP, pages 5967–5994, 2023.
- Peter Young et al. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. TACL, 2:67–78, 2014.
- BARTScore: Evaluating Generated Text as Text Generation. In NeurIPS, volume 34, pages 27263–27277, 2021.
- VinVL: Revisiting Visual Representations in Vision-language Models. In CVPR, pages 5579–5588, 2021.
- OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068, 2022.
- BERTScore: Evaluating Text Generation with BERT. In ICLR, 2020.
- Wei Zhao et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In EMNLP-IJCNLP, pages 563–578, 2019.
- RegionCLIP: Region-based Language-image Pretraining. In CVPR, pages 16793–16803, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.