Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning (2312.15720v1)

Published 25 Dec 2023 in cs.CV

Abstract: Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Sequential latent spaces for modeling the intention during diverse image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4261–4270.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1728–1738.
  3. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
  4. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
  5. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
  7. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 190–200.
  8. Variational structured semantic inference for diverse image captioning. Advances in Neural Information Processing Systems, 32.
  9. Learning Distinct and Representative Modes for Image Captioning. In Advances in Neural Information Processing Systems.
  10. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8307–8316.
  11. Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE international conference on computer vision, 2970–2979.
  12. Variational stacked local attention networks for diverse video captioning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4070–4079.
  13. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10695–10704.
  14. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467.
  15. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1473–1482.
  16. Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18009–18019.
  17. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5630–5639.
  18. Text with Knowledge Graph Augmented Transformer for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18941–18951.
  19. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6546–6555.
  20. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  21. Creativity: Generating diverse questions using variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6485–6494.
  22. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1780–1790.
  23. Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2): 83–97.
  24. Generating diverse and accurate visual captions by comparative adversarial learning. arXiv preprint arXiv:1804.00861.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  26. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
  27. Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
  28. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17949–17958.
  29. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
  30. Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training. IEEE Transactions on Multimedia.
  31. Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings. In International Conference on Learning Representations.
  32. Diverse image captioning with context-object split latent spaces. Advances in Neural Information Processing Systems, 33: 3613–3624.
  33. A framework for multiple-instance learning. Advances in neural information processing systems, 10.
  34. Search-oriented Micro-video Captioning. In Proceedings of the 30th ACM International Conference on Multimedia, 3234–3243.
  35. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6504–6512.
  36. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
  37. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
  38. Improving video captioning with temporal composition of a visual-syntactic embedding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3039–3049.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  40. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992.
  41. Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2514–2522.
  42. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17959–17968.
  43. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
  44. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  45. Attention is all you need. Advances in neural information processing systems, 30.
  46. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575.
  47. Diverse beam search for improved description of complex scenes. In Proceedings of the AAAI Conference on Artificial Intelligence, 1.
  48. Describing like humans: on diversity in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4195–4203.
  49. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6847–6857.
  50. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581–4591.
  51. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
  52. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5288–5296.
  53. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16442–16453.
  54. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10714–10726.
  55. Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17939–17948.
  56. Controllable video captioning with an exemplar sentence. In Proceedings of the 28th ACM International Conference on Multimedia, 1085–1093.
  57. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9837–9846.
  58. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13278–13288.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yifan Lu (38 papers)
  2. Ziqi Zhang (64 papers)
  3. Chunfeng Yuan (35 papers)
  4. Peng Li (390 papers)
  5. Yan Wang (733 papers)
  6. Bing Li (374 papers)
  7. Weiming Hu (91 papers)
Citations (3)