Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Length-Controllable Image Captioning (2007.09580v1)

Published 19 Jul 2020 in cs.CV, cs.CL, and cs.LG

Abstract: The last decade has witnessed remarkable progress in the image captioning task; however, most existing methods cannot control their captions, \emph{e.g.}, choosing to describe the image either roughly or in detail. In this paper, we propose to use a simple length level embedding to endow them with this ability. Moreover, due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows. Thus, we further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity. We verify the merit of the proposed length level embedding on three models: two state-of-the-art (SOTA) autoregressive models with different types of decoder, as well as our proposed non-autoregressive model, to show its generalization ability. In the experiments, our length-controllable image captioning models not only achieve SOTA performance on the challenging MS COCO dataset but also generate length-controllable and diverse image captions. Specifically, our non-autoregressive model outperforms the autoregressive baselines in terms of controllability and diversity, and also significantly improves the decoding efficiency for long captions. Our code and models are released at \textcolor{magenta}{\texttt{https://github.com/bearcatt/LaBERT}}.

Length-Controllable Image Captioning

The paper "Length-Controllable Image Captioning" introduces a novel approach to image captioning by integrating length level embeddings, allowing for controlled variation in the length and detail of the generated captions. This method enhances both autoregressive and non-autoregressive models, presenting improvements in flexibility and decoding efficiency that are particularly relevant for long captions.

The research outlines the implementation and integration of length level embeddings into existing image captioning frameworks. It demonstrates their applicability to two state-of-the-art autoregressive models: AoANet and VLP. The use of this technique enables these models to generate captions in varying lengths, from concise to detailed, by simply adding a length level-specific embedding to the input token embeddings. This approach allows models to retain or even improve performance on standard metrics like CIDEr-D and SPICE across typical caption lengths (10-14 tokens), and excels in generating high-quality results for longer captions.

A significant contribution is the introduction of a non-autoregressive model, termed LaBERT (Length-aware BERT). This model utilizes BERT's architecture with modified input embedding layers to include both image features and length level information. The non-autoregressive structure offers substantial gains in caption generation efficiency, able to produce captions in parallel without the dependency on sequential token-by-token generation typical of autoregressive models. LaBERT employs a strategy of iterative refinement to improve caption quality and maintains decoding complexity independent of caption length, presenting a marked advantage in speed over autoregressive counterparts.

Key numerical results indicate the effectiveness of length-level embedding across models. The length-aware versions of AoANet and VLP outperform their original designs in CIDEr-D and SPICE scores for specific length ranges, with more pronounced improvements in caption detail and diversity when generating longer captions. LaBERT exhibits competitive scores across metrics and achieves higher control precision and semantic diversity compared to autoregressive methods, as evidenced by SelfCIDEr, Div-1, and Div-2 metrics.

The implications of this research are twofold: practically, it provides an efficient method for generating diverse captions suited to specific user requirements without compromising quality; theoretically, it suggests an avenue for future image captioning models to incorporate modular, user-driven constraints on output structure, enhancing their applicability and user interactivity in AI systems. Looking forward, the integration of varied constraint-oriented embeddings in model architectures poses considerable potential for advancements in AI-driven language and vision tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chaorui Deng (12 papers)
  2. Ning Ding (122 papers)
  3. Mingkui Tan (124 papers)
  4. Qi Wu (323 papers)
Citations (52)
Github Logo Streamline Icon: https://streamlinehq.com