Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLID: Controlled-Length Image Descriptions with Limited Data (2211.14835v2)

Published 27 Nov 2022 in cs.CV

Abstract: Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to paragraph generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Clue: Cross-modal coherence modeling for caption generation. arXiv preprint arXiv:2005.00908, 2020.
  2. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, pages 382–398. Springer, 2016.
  3. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
  4. Amittai Axelrod. Cynical selection of language model training data. arXiv preprint arXiv:1709.02279, 2017.
  5. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  6. Diverse and coherent paragraph generation from images. In Proceedings of the European conference on computer vision (ECCV), pages 729–744, 2018.
  7. Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16846–16856, 2021.
  8. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9962–9971, 2020.
  9. “factual”or“emotional”: Stylized image captioning with adaptive learning and attention. In Proceedings of the European Conference on Computer Vision (ECCV), pages 519–535, 2018.
  10. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  11. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1841–1850, 2019.
  12. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8307–8316, 2019.
  13. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10578–10587, 2020.
  14. Length-controllable image captioning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 712–729. Springer, 2020.
  15. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10695–10704, 2019.
  16. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  17. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  18. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.
  19. Image captioning with scene-graph based semantic concepts. In Proceedings of the 2018 10th International Conference on Machine Learning and Computing, pages 225–229, 2018.
  20. Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv preprint arXiv:1902.05715, 2019.
  21. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10323–10332, 2019.
  22. Image scene graph generation (sgg) benchmark, 2021.
  23. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in neural information processing systems, 31, 2018.
  24. Image captioning: Transforming objects into words. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  25. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  26. HuggingFace. Pegasus paraphraser. https://huggingface.co/tuner007/pegasus_paraphrase, 2020.
  27. Eml-net: An expandable multi-layer network for saliency prediction. Image and Vision Computing, 95:103887, 2020.
  28. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219–1228, 2018.
  29. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
  30. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  31. Imageability-and length-controllable image captioning. IEEE Access, 2021.
  32. Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6271–6280, 2019.
  33. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Pattern Recognition (CVPR), 2017.
  34. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  35. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1910–1918, 2017.
  36. Recurrent topic-transition gan for visual paragraph generation. In Proceedings of the IEEE international conference on computer vision, pages 3362–3371, 2017.
  37. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  38. Peer loss functions: Learning from noisy labels without knowing noise rates. In International Conference on Machine Learning, pages 6226–6236. PMLR, 2020.
  39. Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7219–7228, 2018.
  40. Curiosity-driven reinforcement learning for diverse visual paragraph generation. In Proceedings of the 27th ACM International Conference on Multimedia, pages 2341–2350, 2019.
  41. Show and tell more: Topic-oriented multi-sentence image captioning. In IJCAI, pages 4258–4264, 2018.
  42. Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8591–8600, 2018.
  43. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743, 2019.
  44. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers, pages 220–224, 2010.
  45. Learning from synthetic animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12386–12395, 2020.
  46. In defense of scene graphs for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1407–1416, 2021.
  47. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  48. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  49. Online cross-modal scene retrieval by binary representation and semantic graph. In Proceedings of the 25th ACM international conference on Multimedia, pages 744–752, 2017.
  50. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8367–8375, 2019.
  51. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3752–3761, 2018.
  52. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8376–8384, 2019.
  53. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716–3725, 2020.
  54. Compact scene graphs for layout composition and patch retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  55. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  56. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  57. Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 839–847, 2017.
  58. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  59. Convolutional auto-encoding of sentence topics for image paragraph generation. arXiv preprint arXiv:1908.00249, 2019.
  60. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1508–1517, 2020.
  61. Denoising neural machine translation training with trusted data and online data selection. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 133–143. Association for Computational Linguistics, 2018.
  62. Skeleton key: Image captioning by skeleton-attribute decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7272–7281, 2017.
  63. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2691–2699, 2015.
  64. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017.
  65. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
  66. Hierarchical scene graph encoder-decoder for image paragraph captioning. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4181–4189, 2020.
  67. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10685–10694, 2019.
  68. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  69. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence, 2019.
  70. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11535–11543, 2019.
  71. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021.
  72. Curriculum learning for domain adaptation in neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1903–1915. Association for Computational Linguistics, 2019.
  73. Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.
  74. Comprehensive image captioning via scene graph decomposition. In European Conference on Computer Vision, pages 211–229. Springer, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Elad Hirsch (7 papers)
  2. Ayellet Tal (23 papers)
Citations (3)