Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning (2307.09915v2)

Published 19 Jul 2023 in cs.CV and cs.MM

Abstract: Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles in multimedia analysis. The crucial issue in this task is to model the global and the local matching between the image and different languages. Existing cross-modal embedding methods based on the transformer architecture oversee the local matching between the image region and monolingual words, especially when dealing with diverse languages. To overcome these limitations, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships and local correspondences between images and different languages by using a heterogeneous network. EHAT comprises Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN), and Heterogeneous Co-attention (HCA). The HARN serves as the core network and it captures cross-domain relationships by leveraging visual bounding box representation features to connect word features from two languages and to learn heterogeneous maps. MHCA and HCA facilitate cross-domain integration in the encoder through specialized heterogeneous attention mechanisms, enabling a single model to generate captions in two languages. We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families. The experimental results demonstrate the superior performance of our method compared to existing advanced monolingual methods. Our proposed EHAT framework effectively addresses the challenges of cross-lingual image captioning, paving the way for improved multilingual image analysis and understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in Proceedings of the ACM international conference on image and video retrieval, 2009, pp. 1–9.
  2. L. Xie and X. He, “Picture tags and world knowledge: Learning tag relations from visual semantic sources,” in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 967–976.
  3. P. Cui, S. Liu, and W. Zhu, “General knowledge embedded image representation learning,” IEEE Transactions on Multimedia, vol. 20, no. 1, pp. 198–207, 2017.
  4. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
  5. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7008–7024.
  6. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 971–10 980.
  7. X. Yang, H. Zhang, and J. Cai, “Auto-encoding and distilling scene graphs for image captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2313–2327, 2020.
  8. V.-Q. Nguyen, M. Suganuma, and T. Okatani, “Grit: Faster and better image captioning transformer using dual visual features,” in European Conference on Computer Vision, 2022, pp. 167–184.
  9. X. Li, C. Xu, X. Wang, W. Lan, Z. Jia, G. Yang, and J. Xu, “Coco-cn for cross-lingual image tagging, captioning, and retrieval,” IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2347–2360, 2019.
  10. T. Miyazaki and N. Shimizu, “Cross-lingual image caption generation,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1780–1790.
  11. D. Elliott, S. Frank, and E. Hasler, “Multilingual image description with neural sequence models,” arXiv preprint arXiv:1510.04709, 2015.
  12. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
  13. A. Chen, X. Huang, H. Lin, and X. Li, “Towards annotation-free evaluation of cross-lingual image captioning,” in Proceedings of the 2nd ACM International Conference on Multimedia in Asia, 2021, pp. 1–7.
  14. B. Wang, C. Wang, Q. Zhang, Y. Su, Y. Wang, and Y. Xu, “Cross-lingual image caption generation based on visual attention model,” IEEE Access, vol. 8, pp. 104 543–104 554, 2020.
  15. L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4634–4643.
  16. A. Jaffe, “Generating image descriptions using multilingual data,” in Proceedings of the second conference on machine translation, 2017, pp. 458–464.
  17. J. Gu, S. Joty, J. Cai, and G. Wang, “Unpaired image captioning by language pivoting,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 503–519.
  18. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
  19. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning.   PMLR, 2015, pp. 2048–2057.
  20. W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 499–515.
  21. T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 684–699.
  22. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  23. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  24. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16.   Springer, 2020, pp. 121–137.
  25. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  26. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  27. T. Yu, Y. Yang, Y. Li, L. Liu, H. Fei, and P. Li, “Heterogeneous attention network for effective and efficient cross-modal retrieval,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1146–1156.
  28. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5579–5588.
  29. M. Yang, W. Zhao, W. Xu, Y. Feng, Z. Zhao, X. Chen, and K. Lei, “Multitask learning for cross-domain image captioning,” IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 1047–1061, 2018.
  30. J. H. Tan, C. S. Chan, and J. H. Chuah, “Comic: Toward a compact image captioning model with attention,” IEEE Transactions on Multimedia, vol. 21, no. 10, pp. 2686–2696, 2019.
  31. X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, “Deep hierarchical encoder–decoder network for image captioning,” IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2942–2956, 2019.
  32. Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, “More grounded image captioning by distilling image-text matching model,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4777–4786.
  33. J. Zhang, K. Mei, Y. Zheng, and J. Fan, “Integrating part of speech guidance for image captioning,” IEEE Transactions on Multimedia, vol. 23, pp. 92–104, 2020.
  34. L. Yang, H. Wang, P. Tang, and Q. Li, “Captionnet: A tailor-made recurrent neural network for generating image descriptions,” IEEE Transactions on Multimedia, vol. 23, pp. 835–845, 2020.
  35. L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention network for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 327–10 336.
  36. Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-autoregressive transformer for image captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3139–3143.
  37. X. Yang, H. Zhang, and J. Cai, “Deconfounded image captioning: A causal retrospect,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  38. Q. Huang, Y. Liang, J. Wei, Y. Cai, H. Liang, H.-f. Leung, and Q. Li, “Image difference captioning with instance-level fine-grained feature representation,” IEEE Transactions on Multimedia, vol. 24, pp. 2004–2017, 2021.
  39. Z. Zhang, Q. Wu, Y. Wang, and F. Chen, “Exploring pairwise relationships adaptively from linguistic context in image captioning,” IEEE Transactions on Multimedia, 2021.
  40. L. Yu, J. Zhang, and Q. Wu, “Dual attention on pyramid feature maps for image captioning,” IEEE Transactions on Multimedia, vol. 24, pp. 1775–1786, 2021.
  41. Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2585–2594.
  42. Y. Zhou, Z. Hu, D. Liu, H. Ben, and M. Wang, “Compact bidirectional transformer for image captioning,” arXiv preprint arXiv:2201.01984, 2022.
  43. D. Wang, Z. Hu, Y. Zhou, R. Hong, and M. Wang, “A text-guided generation and refinement model for image captioning,” IEEE Transactions on Multimedia, 2022.
  44. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  45. W. Lan, X. Li, and J. Dong, “Fluency-guided cross-lingual image captioning,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1549–1557.
  46. J. Gao, Y. Zhou, L. Philip, S. Joty, and J. Gu, “Unison: Unpaired cross-lingual image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 654–10 662.
  47. W. Dong, M. Otani, N. Garcia, Y. Nakashima, and C. Chu, “Cross-lingual visual grounding,” IEEE Access, vol. 9, pp. 349–358, 2020.
  48. Z. Chen, F. Yin, Q. Yang, and C.-L. Liu, “Cross-lingual text image recognition via multi-hierarchy cross-modal mimic,” IEEE Transactions on Multimedia, 2022.
  49. P. Aggarwal, R. Tambi, and A. Kale, “Towards zero-shot cross-lingual image retrieval and tagging,” arXiv preprint arXiv:2109.07622, 2021.
  50. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649.
  51. X. Li, W. Lan, J. Dong, and H. Liu, “Adding chinese captions to images,” in Proceedings of the 2016 ACM on international conference on multimedia retrieval, 2016, pp. 271–275.
  52. S. Tsutsui and D. Crandall, “Using artificial tokens to control languages for multilingual image caption generation,” arXiv preprint arXiv:1706.06275, 2017.
  53. Z. Jia and X. Li, “icap: Interactive image captioning with predictive text,” in Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 428–435.
  54. M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, “From show to tell: A survey on image captioning,” arXiv preprint arXiv:2107.06912, 2021.
  55. H. He, Q. Wang, Z. Yu, Y. Zhao, J. Zhang, and C. Zong, “Synchronous interactive decoding for multilingual neural machine translation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 14, 2021, pp. 12 981–12 988.
  56. Y. Wang, J. Zhang, L. Zhou, Y. Liu, and C. Zong, “Synchronously generating two languages with interactive decoding,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3350–3355.
  57. M. Zhou, L. Zhou, S. Wang, Y. Cheng, L. Li, Z. Yu, and J. Liu, “Uc2: Universal cross-lingual cross-modal vision-and-language pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4155–4165.
  58. Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta path-based top-k similarity search in heterogeneous information networks,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 992–1003, 2011.
  59. Y. Sun and J. Han, “Mining heterogeneous information networks: principles and methodologies,” Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 3, no. 2, pp. 1–159, 2012.
  60. C. Shi, B. Hu, W. X. Zhao, and S. Y. Philip, “Heterogeneous information network embedding for recommendation,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 2, pp. 357–370, 2018.
  61. B. Hu, C. Shi, W. X. Zhao, and P. S. Yu, “Leveraging meta-path based context for top-n recommendation with a neural co-attention model,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1531–1540.
  62. T.-y. Fu, W.-C. Lee, and Z. Lei, “Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1797–1806.
  63. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  64. J. Zhao, X. Wang, C. Shi, B. Hu, G. Song, and Y. Ye, “Heterogeneous graph structure learning for graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 4697–4705.
  65. H. Linmei, T. Yang, C. Shi, H. Ji, and X. Li, “Heterogeneous graph attention networks for semi-supervised short text classification,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4821–4830.
  66. L. Hu, S. Xu, C. Li, C. Yang, C. Shi, N. Duan, X. Xie, and M. Zhou, “Graph neural news recommendation with unsupervised preference disentanglement,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4255–4264.
  67. X. Wang, D. Bo, C. Shi, S. Fan, Y. Ye, and S. Y. Philip, “A survey on heterogeneous graph embedding: methods, techniques, applications and sources,” IEEE Transactions on Big Data, 2022.
  68. Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph transformer,” in Proceedings of The Web Conference 2020, 2020, pp. 2704–2710.
  69. S. Yao, T. Wang, and X. Wan, “Heterogeneous graph transformer for graph-to-sequence learning,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7145–7154.
  70. X. Mei, X. Cai, L. Yang, and N. Wang, “Relation-aware heterogeneous graph transformer based drug repurposing,” Expert Systems with Applications, vol. 190, p. 116165, 2022.
  71. Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, “Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering,” arXiv preprint arXiv:2006.09073, 2020.
  72. J. Yu, Z. Zhu, Y. Wang, W. Zhang, Y. Hu, and J. Tan, “Cross-modal knowledge reasoning for knowledge-based visual question answering,” Pattern Recognition, vol. 108, p. 107563, 2020.
  73. W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  74. Z. Song, Z. Hu, and R. Hong, “Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning,” Multimedia Systems, vol. 29, no. 5, pp. 3017–3026, 2023.
  75. C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, “Heterogeneous memory enhanced multimodal attention model for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1999–2007.
  76. P. Jiang and Y. Han, “Reasoning with heterogeneous graph alignment for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 109–11 116.
  77. D. Cai, S. Qian, Q. Fang, and C. Xu, “Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation,” IEEE Transactions on Multimedia, vol. 24, pp. 805–818, 2021.
  78. D. Cai, S. Qian, Q. Fang, J. Hu, W. Ding, and C. Xu, “Heterogeneous graph contrastive learning network for personalized micro-video recommendation,” IEEE Transactions on Multimedia, 2022.
  79. P. Zhu, X. Yao, Y. Wang, M. Cao, B. Hui, S. Zhao, and Q. Hu, “Latent heterogeneous graph network for incomplete multi-view learning,” IEEE Transactions on Multimedia, 2022.
  80. J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-captioning: Coarse-to-fine learning for image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  81. R. Luo, “A better variant of self-critical sequence training,” arXiv preprint arXiv:2003.09971, 2020.
  82. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
  83. X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 685–10 694.
  84. S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  85. M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 578–10 587.
  86. Z. Fan, Z. Wei, S. Wang, R. Wang, Z. Li, H. Shan, and X. Huang, “Tcic: Theme concepts learning cross language and vision for image captioning,” arXiv preprint arXiv:2106.10936, 2021.
  87. X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 465–15 474.
  88. P. Zeng, H. Zhang, J. Song, and L. Gao, “S2 transformer for image captioning,” in Proceedings of the International Joint Conferences on Artificial Intelligence, vol. 5, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhenzhen Hu (16 papers)
  2. Yuanen Zhou (4 papers)
  3. Ye Zhao (63 papers)
  4. Richang Hong (117 papers)
  5. Meng Wang (1063 papers)
  6. ZiJie Song (6 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com