Improving Reference-based Distinctive Image Captioning with Contrastive Rewards (2306.14259v1)
Abstract: Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to force the generated captions to distinguish between the target image and the reference image. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC. The model only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Taking one step further, we propose a stronger TransDIC++, which consists of an extra contrastive learning module to make full use of the reference images. This new module is model-agnostic, which can be easily incorporated into various Ref-DIC architectures. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC++ can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.
- Bottom-up and top-down attention for image captioning and visual question answering. In IEEE Conf. Comput. Vis. Pattern Recog.
- Layer Normalization. (2016).
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Association for Computational Linguistics workshop.
- Groupcap: Group-based image captioning with structured relevance and diversity constraints. In IEEE Conf. Comput. Vis. Pattern Recog. 1345–1353.
- Human-like controllable image captioning with verb-specific semantic roles. In IEEE Conf. Comput. Vis. Pattern Recog. 16846–16856.
- Counterfactual critic multi-agent training for scene graph generation. In ICCV.
- Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In IEEE Conf. Comput. Vis. Pattern Recog.
- Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs. In IEEE Conf. Comput. Vis. Pattern Recog.
- A simple framework for contrastive learning of visual representations. In ICML. PMLR, 1597–1607.
- Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv preprint (2015).
- Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In IEEE Conf. Comput. Vis. Pattern Recog. 15750–15758.
- Uniter: Universal image-text representation learning. In Eur. Conf. Comput. Vis. Springer, 104–120.
- Meshed-Memory Transformer for Image Captioning. In IEEE Conf. Comput. Vis. Pattern Recog.
- Towards diverse and natural image descriptions via a conditional gan. In Int. Conf. Comput. Vis.
- Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. In Adv. Neural Inform. Process. Syst.
- Vse++: Improving visual-semantic embeddings with hard negatives. (2018).
- Open vocabulary object detection with pseudo bounding-box labels. In Eur. Conf. Comput. Vis. 266–282.
- Bootstrap your own latent-a new approach to self-supervised learning. NIPS 33 (2020), 21271–21284.
- Face recognition with contrastive convolution. In ECCV. 118–134.
- Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog. 9729–9738.
- Attention on attention for image captioning. In Int. Conf. Comput. Vis. 4634–4643.
- Densecap: Fully convolutional localization networks for dense captioning. In IEEE Conf. Comput. Vis. Pattern Recog. 4565–4574.
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conf. Comput. Vis. Pattern Recog.
- Entangled transformer for image captioning. In Int. Conf. Comput. Vis. 8928–8937.
- The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation. In IEEE Conf. Comput. Vis. Pattern Recog. 18869–18878.
- Context-aware group captioning via self-attention and contrastive features. In IEEE Conf. Comput. Vis. Pattern Recog. 3440–3450.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Association for Computational Linguistics workshop.
- Toward region-aware attention learning for scene graph generation. (2021).
- Region-Aware Image Captioning via Interaction Learning. IEEE Trans. Circuit Syst. Video Technol. (2021).
- Generating Diverse and Descriptive Image Captions Using Visual Paraphrases. In Int. Conf. Comput. Vis.
- Improved image captioning via policy gradient optimization of spider. In Int. Conf. Comput. Vis. 873–881.
- Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Eur. Conf. Comput. Vis.
- Discriminability objective for training descriptive captions. In IEEE Conf. Comput. Vis. Pattern Recog.
- Rethinking the reference-based distinctive image captioning. In ACM Int. Conf. Multimedia. 4374–4384.
- BLEU: a method for automatic evaluation of machine translation. In Meeting of Association for Computational Linguistics.
- Robust change captioning. In Int. Conf. Comput. Vis. 4624–4633.
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Int. Conf. Comput. Vis.
- Describing and Localizing Multiple Changes with Transformers. In Int. Conf. Comput. Vis. 1971–1980.
- Learning transferable visual models from natural language supervision. In Int. Conf. Machine Learn. 8748–8763.
- Sequence level training with recurrent neural networks. arXiv preprint (2015).
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2016).
- Self-critical sequence training for image captioning. In IEEE Conf. Comput. Vis. Pattern Recog.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis. 618–626.
- Speaking the same language: Matching machine to human captions by adversarial training. In Int. Conf. Comput. Vis.
- A corpus for reasoning about natural language grounded in photographs. (2019).
- Expressing visual relationships via language. (2019).
- Attention is all you need. In Adv. Neural Inform. Process. Syst.
- Context-aware captions from context-agnostic supervision. In IEEE Conf. Comput. Vis. Pattern Recog.
- Cider: Consensus-based image description evaluation. In IEEE Conf. Comput. Vis. Pattern Recog.
- Show and tell: A neural image caption generator. In IEEE Conf. Comput. Vis. Pattern Recog.
- Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets. In Eur. Conf. Comput. Vis.
- Group-based distinctive image captioning with memory attention. In ACM Int. Conf. Multimedia. 5020–5028.
- High-Order Interaction Learning for Image Captioning. IEEE Trans. Circuit Syst. Video Technol. (2021).
- Towards Unique and Informative Captioning of Images. In Eur. Conf. Comput. Vis.
- Show, attend and tell: Neural image caption generation with visual attention. In Int. Conf. Machine Learn.
- Scene graph inference via multi-scale context modeling. IEEE Trans. Circuit Syst. Video Technol. (2020), 1031–1041.
- Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimedia (2019), 1372–1383.
- L2C: Describing Visual Differences Needs Semantic Understanding of Individuals. (2021).
- Auto-encoding scene graphs for image captioning. In IEEE Conf. Comput. Vis. Pattern Recog. 10685–10694.
- Exploring visual relationship for image captioning. In Eur. Conf. Comput. Vis. 684–699.
- Sergey Zagoruyko and Nikos Komodakis. 2015. Learning to compare image patches via convolutional neural networks. In IEEE Conf. Comput. Vis. Pattern Recog. 4353–4361.
- Yangjun Mao (2 papers)
- Jun Xiao (134 papers)
- Dong Zhang (169 papers)
- Meng Cao (107 papers)
- Jian Shao (29 papers)
- Yueting Zhuang (164 papers)
- Long Chen (395 papers)