Papers
Topics
Authors
Recent
2000 character limit reached

Length-Controllable Image Captioning

Published 19 Jul 2020 in cs.CV, cs.CL, and cs.LG | (2007.09580v1)

Abstract: The last decade has witnessed remarkable progress in the image captioning task; however, most existing methods cannot control their captions, \emph{e.g.}, choosing to describe the image either roughly or in detail. In this paper, we propose to use a simple length level embedding to endow them with this ability. Moreover, due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows. Thus, we further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity. We verify the merit of the proposed length level embedding on three models: two state-of-the-art (SOTA) autoregressive models with different types of decoder, as well as our proposed non-autoregressive model, to show its generalization ability. In the experiments, our length-controllable image captioning models not only achieve SOTA performance on the challenging MS COCO dataset but also generate length-controllable and diverse image captions. Specifically, our non-autoregressive model outperforms the autoregressive baselines in terms of controllability and diversity, and also significantly improves the decoding efficiency for long captions. Our code and models are released at \textcolor{magenta}{\texttt{https://github.com/bearcatt/LaBERT}}.

Citations (52)

Summary

  • The paper introduces length-level embeddings that adjust caption length, enhancing model flexibility and decoding efficiency.
  • It integrates these embeddings in state-of-the-art models like AoANet and VLP, leading to improved metrics such as CIDEr-D and SPICE.
  • It presents a novel non-autoregressive model, LaBERT, which uses iterative refinement to generate captions efficiently without sequential dependency.

Length-Controllable Image Captioning

The paper "Length-Controllable Image Captioning" introduces a novel approach to image captioning by integrating length level embeddings, allowing for controlled variation in the length and detail of the generated captions. This method enhances both autoregressive and non-autoregressive models, presenting improvements in flexibility and decoding efficiency that are particularly relevant for long captions.

The research outlines the implementation and integration of length level embeddings into existing image captioning frameworks. It demonstrates their applicability to two state-of-the-art autoregressive models: AoANet and VLP. The use of this technique enables these models to generate captions in varying lengths, from concise to detailed, by simply adding a length level-specific embedding to the input token embeddings. This approach allows models to retain or even improve performance on standard metrics like CIDEr-D and SPICE across typical caption lengths (10-14 tokens), and excels in generating high-quality results for longer captions.

A significant contribution is the introduction of a non-autoregressive model, termed LaBERT (Length-aware BERT). This model utilizes BERT's architecture with modified input embedding layers to include both image features and length level information. The non-autoregressive structure offers substantial gains in caption generation efficiency, able to produce captions in parallel without the dependency on sequential token-by-token generation typical of autoregressive models. LaBERT employs a strategy of iterative refinement to improve caption quality and maintains decoding complexity independent of caption length, presenting a marked advantage in speed over autoregressive counterparts.

Key numerical results indicate the effectiveness of length-level embedding across models. The length-aware versions of AoANet and VLP outperform their original designs in CIDEr-D and SPICE scores for specific length ranges, with more pronounced improvements in caption detail and diversity when generating longer captions. LaBERT exhibits competitive scores across metrics and achieves higher control precision and semantic diversity compared to autoregressive methods, as evidenced by SelfCIDEr, Div-1, and Div-2 metrics.

The implications of this research are twofold: practically, it provides an efficient method for generating diverse captions suited to specific user requirements without compromising quality; theoretically, it suggests an avenue for future image captioning models to incorporate modular, user-driven constraints on output structure, enhancing their applicability and user interactivity in AI systems. Looking forward, the integration of varied constraint-oriented embeddings in model architectures poses considerable potential for advancements in AI-driven language and vision tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (4)

Collections

Sign up for free to add this paper to one or more collections.