Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised Cross-view Representation Reconstruction for Change Captioning (2309.16283v1)

Published 28 Sep 2023 in cs.CV and cs.CL

Abstract: Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a hallucination'' representation with the caption andbefore'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Spice: Semantic propositional image caption evaluation. In ECCV, pages 382–398, 2016.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL, pages 65–72, 2005.
  4. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  5. Ls-gan: iterative language-based image manipulation via long and short term consistency reasoning. In ACM MM, pages 4496–4504, 2022.
  6. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  7. Image change captioning by learning from an auxiliary task. In CVPR, pages 2725–2734, 2021.
  8. Change captioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022.
  9. Image difference captioning with instance-level fine-grained feature representation. IEEE Transactions on Multimedia, 24:2004–2017, 2022.
  10. Learning to describe differences between pairs of similar images. In EMNLP, pages 4024–4034, 2018.
  11. Agnostic change captioning with cycle consistency. In ICCV, pages 2095–2104, 2021.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Long short-term relation transformer with global gating for video captioning. IEEE Transactions on Image Processing, 31:2726–2738, 2022.
  14. Dynamic graph enhanced contrastive learning for chest x-ray report generation. In CVPR, pages 3334–3343, 2023.
  15. Scene graph with 3d information for change captioning. In ACM MM, pages 5074–5082, 2021.
  16. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  17. Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, pages 17949–17958, 2022.
  18. Contrastive attention for automatic chest x-ray report generation. In Findings of ACL, pages 269–280, 2021.
  19. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3003–3018, 2022.
  20. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  21. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
  22. Robust change captioning. In ICCV, pages 4624–4633, 2019.
  23. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
  24. Describing and localizing multiple changes with transformers. In ICCV, pages 1971–1980, 2021.
  25. Finding it at another side: A viewpoint-adapted matching encoder for change captioning. In ECCV, pages 574–590, 2020.
  26. Bidirectional difference locating and semantic consistency reasoning for change captioning. International Journal of Intelligent Systems, 37(5):2969–2987, 2022.
  27. Expressing visual relationships via language. In ACL, pages 1873–1883, 2019.
  28. Viewpoint-adaptive representation disentanglement network for change captioning. IEEE Transactions on Image Processing, 32:2620–2635, 2023.
  29. I2transformer: Intra- and inter-relation embedding transformer for tv show captioning. IEEE Transactions on Image Processing, 31:3565–3577, 2022.
  30. Neighborhood contrastive transformer for change captioning. IEEE Transactions on Multimedia, 2023.
  31. R^3Net:relation-embedded representation reconstruction network for change captioning. In EMNLP, pages 9319–9329, 2021.
  32. Semantic relation-aware difference representation learning for change captioning. In Findings of ACL, pages 63–73, 2021.
  33. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  34. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
  35. Semantic and relation modulation for audio-visual event localization. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(06):7711–7725, 2023.
  36. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022.
  37. Filip: Fine-grained interactive language-image pre-training. In ICLR, 2022.
  38. Image difference captioning with pre-training and contrastive learning. In AAAI, pages 3108–3116, 2022.
  39. I3n: Intra- and inter-representation interaction network for change captioning. IEEE Transactions on Multimedia, pages 1–14, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.