Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Implicit and Explicit Commonsense for Multi-sentence Video Captioning (2303.07545v2)

Published 14 Mar 2023 in cs.CV

Abstract: Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset [54] generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  2. VQA: Visual Question Answering. In ICCV, 2015.
  3. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  4. G3raphground: Graph-based language grounding. In CVPR, 2019.
  5. Mutan: Multimodal Tucker fusion for visual question answering. In ICCV, 2017.
  6. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. 2021.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  9. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. WACV, 2021.
  10. It’s not rocket science: Interpreting figurative language in narratives. TACL, 2022.
  11. Ratthachat Chatpatanasiri. Gpt3 and commonsense reasoning. https://agi.miraheze.org/, 2021.
  12. Motion guided spatial attention for video captioning. In AAAI, 2019.
  13. Semi-supervised grounding alignment for multi-modal feature learning. In CRV, 2022.
  14. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013.
  15. Visual grounding via accumulated attention. In CVPR, 2018.
  16. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, 2014.
  17. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, 2020.
  18. Conceptbert: Concept-aware representation for visual question answering. In EMNLP, 2020.
  19. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  20. Deep residual learning for image recognition. In CVPR, 2016.
  21. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. In AAAI, 2021.
  22. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In BMVC, 2020.
  23. Multi-modal dense video captioning. In CVPRW, 2020.
  24. How can we know what language models know? In TACL, 2020.
  25. Densecap: Fully convolutional localization networks for dense captioning. In CVPR, 2016.
  26. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  27. Natural language description of human activities from video images based on concept hierarchy of actions. IJCV, 2002.
  28. Ai2-thor: An interactive 3d environment for visual ai. ArXiv, 2017.
  29. Dense-captioning events in videos. In ICCV, 2017.
  30. Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. In ACL, 2020.
  31. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, 2023.
  33. Jointly localizing and describing events for dense video captioning. In CVPR, 2018.
  34. Kagnet: Knowledge-aware graph networks for commonsense reasoning. In EMNLP-IJCNLP, 2019.
  35. Video paragraph captioning as a text summarization task. In ACL-IJCNLP, 2021.
  36. Like hiking? you probably enjoy nature: Persona-grounded dialog with commonsense expansions. In EMNLP, 2020.
  37. Middle-out decoding. NeurIPS, 2018.
  38. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  39. Streamlined dense video captioning. In CVPR, 2019.
  40. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  41. Adversarial inference for multi-sentence video description. In CVPR, 2019.
  42. Does pre-training induce systematic inference? how masked language models acquire commonsense knowledge. ACL, 2022.
  43. Learning transferable visual models from natural language supervision. In ICML, 2021.
  44. Improving language understanding by generative pre-training. 2018.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  46. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In ICCV, 2019.
  47. Vlc-bert: Visual question answering with contextualized commonsense knowledge. ArXiv, 2022.
  48. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
  49. Translating video content to natural language descriptions. In ICCV, 2013.
  50. Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
  51. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI, 2019.
  52. End-to-end generative pretraining for multimodal video captioning. In CVPR, 2022.
  53. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  54. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020.
  55. Unsupervised commonsense question answering with self-talk. In EMNLP, 2020.
  56. Towards diverse paragraph captioning for untrimmed videos. In CVPR, 2021.
  57. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017.
  58. Sequence to sequence learning with neural networks. NeurIPS, 2014.
  59. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  60. HypoGen: Hyperbole generation with commonsense and counterfactual knowledge. In EMNLP, 2021.
  61. A simple method for commonsense reasoning. 2018.
  62. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  63. Sequence to sequence-video to text. In ICCV, 2015.
  64. Superglue: A stickier benchmark for general-purpose language understanding systems. NeurIPS, 2019.
  65. Bidirectional attentive fusion with context gating for dense video captioning. In CVPR, 2018.
  66. Video captioning via hierarchical reinforcement learning. In CVPR, 2018.
  67. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. NAACL HLT, 2018.
  68. Neural text generation with unlikelihood training. In ICLR, 2020.
  69. Weakly-supervised visual grounding of phrases with linguistic structures. In CVPR, 2017.
  70. Rethinking spatiotemporal feature learning for video understanding. ArXiv, 2017.
  71. Move forward and tell: A progressive generator of video descriptions. In ECCV, 2018.
  72. Joint event detection and description in continuous video streams. In WACV, 2019.
  73. Describing videos by exploiting temporal structure. In ICCV, 2015.
  74. Unifying event detection and captioning as sequence generation via pre-training. In ECCV. Springer, 2022.
  75. End-to-end dense video captioning with masked transformer. In CVPR, 2018.
  76. Pre-training text-to-text transformers for concept-centric common sense. In ICLR, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shih-Han Chou (9 papers)
  2. James J. Little (24 papers)
  3. Leonid Sigal (102 papers)
Citations (1)