Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rule-driven News Captioning (2403.05101v3)

Published 8 Mar 2024 in cs.CL and cs.AI

Abstract: News captioning task aims to generate sentences by describing named entities or concrete events for an image with its news article. Existing methods have achieved remarkable results by relying on the large-scale pre-trained models, which primarily focus on the correlations between the input news content and the output predictions. However, the news captioning requires adhering to some fundamental rules of news reporting, such as accurately describing the individuals and actions associated with the event. In this paper, we propose the rule-driven news captioning method, which can generate image descriptions following designated rule signal. Specifically, we first design the news-aware semantic rule for the descriptions. This rule incorporates the primary action depicted in the image (e.g., "performing") and the roles played by named entities involved in the action (e.g., "Agent" and "Place"). Second, we inject this semantic rule into the large-scale pre-trained model, BART, with the prefix-tuning strategy, where multiple encoder layers are embedded with news-aware semantic rule. Finally, we can effectively guide BART to generate news sentences that comply with the designated rule. Extensive experiments on two widely used datasets (i.e., GoodNews and NYTimes800k) demonstrate the effectiveness of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. M. Wu, X. Zhang, X. Sun, Y. Zhou, C. Chen, J. Gu, X. Sun, and R. Ji, “Difnet: Boosting visual information flow for image captioning,” in CVPR, 2022, pp. 17 999–18 008.
  2. M. Al-Qatf, X. Wang, A. Hawbani, A. Abdusallam, and S. H. Alsamhi, “Image captioning with novel topics guidance and retrieval-based topics re-weighting,” IEEE Transactions on Multimedia, pp. 1–16, 2022.
  3. A. Tran, A. P. Mathews, and L. Xie, “Transform and tell: Entity-aware news image captioning,” in CVPR, 2020, pp. 13 032–13 042.
  4. X. Yang, S. Karaman, J. R. Tetreault, and A. Jaimes, “Journalistic guidelines aware news image captioning,” in EMNLP, 2021, pp. 5162–5175.
  5. A. F. Biten, L. Gómez, M. Rusiñol, and D. Karatzas, “Good news, everyone! context driven entity-aware captioning for news images,” in CVPR, 2019, pp. 12 466–12 475.
  6. J. Zhang, S. Fang, Z. Mao, Z. Zhang, and Y. Zhang, “Fine-tuning with multi-modal entity prompts for news image captioning,” in ACMMM, 2022, pp. 4365–4373.
  7. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL, 2020, pp. 7871–7880.
  8. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015, pp. 3156–3164.
  9. G. Li, L. Zhu, P. Liu, and Y. Yang, “Entangled transformer for image captioning,” in ICCV, 2019, pp. 8927–8936.
  10. P. Zhu, X. Wang, Y. Luo, Z. Sun, W.-S. Zheng, Y. Wang, and C. Chen, “Unpaired image captioning by image-level weakly-supervised visual concept recognition,” IEEE Transactions on Multimedia, pp. 1–15, 2022.
  11. T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in ECCV, vol. 11218, 2018, pp. 711–727.
  12. Y. Zhu, J. Ma, C. Yuan, and X. Zhu, “Interpretable learning based dynamic graph convolutional networks for alzheimer’s disease analysis,” Information Fusion, vol. 77, pp. 53–61, 2022.
  13. J. Gan, R. Hu, Y. Mo, Z. Kang, L. Peng, Y. Zhu, and X. Zhu, “Multigraph fusion for dynamic graph convolutional network,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  14. W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “CPTR: full transformer network for image captioning,” CoRR, vol. abs/2101.10804, 2021.
  15. L. Wang, H. Li, W. Hu, X. Zhang, H. Qiu, F. Meng, and Q. Wu, “What happens in crowd scenes: A new dataset about crowd scenes for image captioning,” IEEE Transactions on Multimedia, pp. 1–13, 2022.
  16. X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in CVPR, 2021, pp. 15 465–15 474.
  17. M. Cornia, L. Baraldi, G. Fiameni, and R. Cucchiara, “Universal captioner: Long-tail vision-and-language model training through content-style separation,” CoRR, vol. abs/2111.12727, 2021.
  18. S. Jing, H. Zhang, P. Zeng, L. Gao, J. Song, and H. T. Shen, “Memory-based augmentation network for video captioning,” IEEE Transactions on Multimedia, pp. 1–13, 2023.
  19. A. Ramisa, F. Yan, F. Moreno-Noguer, and K. Mikolajczyk, “Breakingnews: Article annotation by image and text processing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5, pp. 1072–1085, 2018.
  20. A. Hu, S. Chen, and Q. Jin, “ICECAP: information concentrated entity-aware image captioning,” CoRR, vol. abs/2108.02050, 2021.
  21. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
  22. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  23. W. Zhao, Y. Hu, H. Wang, X. Wu, and J. Luo, “Boosting entity-aware image captioning with multi-modal knowledge graph,” CoRR, vol. abs/2107.11970, 2021.
  24. M. Zhou, G. Luo, A. Rohrbach, and Z. Yu, “Focus! relevant and sufficient context selection for news image captioning,” in EMNLP, 2022, pp. 6078–6088.
  25. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” CoRR, vol. abs/2005.14165, 2020.
  26. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023.
  27. A. Chowdhery, S. Narang, and J. D. et al., “Palm: Scaling language modeling with pathways,” CoRR, vol. abs/2204.02311, 2022.
  28. M. Dehghani, J. Djolonga, and B. M. et al., “Scaling vision transformers to 22 billion parameters,” CoRR, vol. abs/2302.05442, 2023.
  29. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Dollár, and R. B. Girshick, “Segment anything,” CoRR, vol. abs/2304.02643, 2023.
  30. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, vol. 139, 2021, pp. 8748–8763.
  31. Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” CoRR, vol. abs/2303.01903, 2023.
  32. Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of GPT-3 for few-shot knowledge-based VQA,” in AAAI, 2022, pp. 3081–3089.
  33. S. Wu, G. Zhao, and X. Qian, “Resolving zero-shot and fact-based visual question answering via enhanced fact retrieval,” IEEE Transactions on Multimedia, pp. 1–11, 2023.
  34. L. Peng, Y. Mo, J. Xu, J. Shen, X. Shi, X. Li, H. T. Shen, and X. Zhu, “Grlc: Graph representation learning with constraints,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  35. Y. Mo, Y. Chen, Y. Lei, L. Peng, X. Shi, C. Yuan, and X. Zhu, “Multiplex graph representation learning via dual correlation reduction,” IEEE Transactions on Knowledge and Data Engineering, 2023.
  36. S. M. Pratt, M. Yatskar, L. Weihs, A. Farhadi, and A. Kembhavi, “Grounded situation recognition,” in ECCV, vol. 12349, 2020, pp. 314–332.
  37. C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The berkeley framenet project,” in COLING-ACL, 1998, pp. 86–90.
  38. Y. Liu, Y. Wan, L. He, H. Peng, and S. Y. Philip, “Kg-bart: Knowledge graph-augmented bart for generative commonsense reasoning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 7, 2021, pp. 6418–6425.
  39. S. Li, H. Yan, and X. Qiu, “Contrast and generation make bart a good dialogue emotion recognizer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 10, 2022, pp. 11 002–11 010.
  40. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in ACL/IJCNLP, 2021, pp. 4582–4597.
  41. K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, vol. 37, 2015, pp. 2048–2057.
  42. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in CVPR, 2017, pp. 1179–1195.
  43. K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in USA, 2002, pp. 311–318.
  44. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out, July 2004, pp. 74–81.
  45. M. J. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in USA, 2014, pp. 376–380.
  46. R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, 2015, pp. 4566–4575.
  47. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ning Xu (151 papers)
  2. Tingting Zhang (53 papers)
  3. Hongshuo Tian (4 papers)
  4. An-An Liu (20 papers)

Summary

We haven't generated a summary for this paper yet.