Transform and Tell: Entity-Aware News Image Captioning (2004.08070v2)

Published 17 Apr 2020 in cs.CV and cs.CL

Abstract: We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer LLM that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of LLMs, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.

Authors (3)

Alasdair Tran (6 papers)
Alexander Mathews (9 papers)
Lexing Xie (54 papers)

Citations (93)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Transform and Tell: Entity-Aware News Image Captioning (2004.08070v2)

Summary

Related Papers