Bounding and Filling: A Fast and Flexible Framework for Image Captioning (2310.09876v1)

Published 15 Oct 2023 in cs.CV and cs.CL

Abstract: Most image captioning models following an autoregressive manner suffer from significant inference latency. Several models adopted a non-autoregressive manner to speed up the process. However, the vanilla non-autoregressive manner results in subpar performance, since it generates all words simultaneously, which fails to capture the relationships between words in a description. The semi-autoregressive manner employs a partially parallel method to preserve performance, but it sacrifices inference speed. In this paper, we introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques. The BoFiCap model leverages the inherent characteristics of image captioning tasks to pre-define bounding boxes for image regions and their relationships. Subsequently, the BoFiCap model fills corresponding words in each box using two-generation manners. Leveraging the box hints, our filling process allows each word to better perceive other words. Additionally, our model offers flexible image description generation: 1) by employing different generation manners based on speed or performance requirements, 2) producing varied sentences based on user-specified boxes. Experimental evaluations on the MS-COCO benchmark dataset demonstrate that our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr (125.6) while speeding up 9.22x than the baseline model with an autoregressive manner; in a semi-autoregressive manner, our method reaches 128.4 on CIDEr while a 3.69x speedup. Our code and data is available at https://github.com/ChangxinWang/BoFiCap.

Authors (5)

Zheng Ma (110 papers)
Changxin Wang (7 papers)
Bo Huang (66 papers)
Zixuan Zhu (8 papers)
Jianbing Zhang (29 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - ChangxinWang/BoFiCap: Bounding and Filling: A Fast and Flexible Framework for Image Captioning (9 stars)

Tweets

https://twitter.com/BuildUmmah/status/1934838696454574398

https://twitter.com/BuildUmmah/status/1904669927522656478

Bounding and Filling: A Fast and Flexible Framework for Image Captioning (2310.09876v1)

Summary

Related Papers

GitHub

Tweets