Self-supervised Cross-view Representation Reconstruction for Change Captioning (2309.16283v1)
Abstract: Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a hallucination'' representation with the caption and
before'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.
- Spice: Semantic propositional image caption evaluation. In ECCV, pages 382–398, 2016.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL, pages 65–72, 2005.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Ls-gan: iterative language-based image manipulation via long and short term consistency reasoning. In ACM MM, pages 4496–4504, 2022.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Image change captioning by learning from an auxiliary task. In CVPR, pages 2725–2734, 2021.
- Change captioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022.
- Image difference captioning with instance-level fine-grained feature representation. IEEE Transactions on Multimedia, 24:2004–2017, 2022.
- Learning to describe differences between pairs of similar images. In EMNLP, pages 4024–4034, 2018.
- Agnostic change captioning with cycle consistency. In ICCV, pages 2095–2104, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Long short-term relation transformer with global gating for video captioning. IEEE Transactions on Image Processing, 31:2726–2738, 2022.
- Dynamic graph enhanced contrastive learning for chest x-ray report generation. In CVPR, pages 3334–3343, 2023.
- Scene graph with 3d information for change captioning. In ACM MM, pages 5074–5082, 2021.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, pages 17949–17958, 2022.
- Contrastive attention for automatic chest x-ray report generation. In Findings of ACL, pages 269–280, 2021.
- Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3003–3018, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
- Robust change captioning. In ICCV, pages 4624–4633, 2019.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
- Describing and localizing multiple changes with transformers. In ICCV, pages 1971–1980, 2021.
- Finding it at another side: A viewpoint-adapted matching encoder for change captioning. In ECCV, pages 574–590, 2020.
- Bidirectional difference locating and semantic consistency reasoning for change captioning. International Journal of Intelligent Systems, 37(5):2969–2987, 2022.
- Expressing visual relationships via language. In ACL, pages 1873–1883, 2019.
- Viewpoint-adaptive representation disentanglement network for change captioning. IEEE Transactions on Image Processing, 32:2620–2635, 2023.
- I2transformer: Intra- and inter-relation embedding transformer for tv show captioning. IEEE Transactions on Image Processing, 31:3565–3577, 2022.
- Neighborhood contrastive transformer for change captioning. IEEE Transactions on Multimedia, 2023.
- R^3Net:relation-embedded representation reconstruction network for change captioning. In EMNLP, pages 9319–9329, 2021.
- Semantic relation-aware difference representation learning for change captioning. In Findings of ACL, pages 63–73, 2021.
- Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
- Semantic and relation modulation for audio-visual event localization. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(06):7711–7725, 2023.
- Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022.
- Filip: Fine-grained interactive language-image pre-training. In ICLR, 2022.
- Image difference captioning with pre-training and contrastive learning. In AAAI, pages 3108–3116, 2022.
- I3n: Intra- and inter-representation interaction network for change captioning. IEEE Transactions on Multimedia, pages 1–14, 2023.