Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets (2111.12727v3)

Published 24 Nov 2021 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt LLMing, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Marcella Cornia (61 papers)
Lorenzo Baraldi (68 papers)
Giuseppe Fiameni (18 papers)
Rita Cucchiara (142 papers)

Citations (11)

View on Semantic Scholar

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets (2111.12727v3)

Related Papers