Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relational Graph Learning for Grounded Video Description Generation (2112.00967v1)

Published 2 Dec 2021 in cs.CV, cs.AI, and cs.CL

Abstract: Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., "jump left or right") are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge to assist captioning models in selecting the relevant information it needs to generate correct words. We validate the effectiveness of our model through automatic metrics and human evaluation, and the results indicate that our approach can generate more fine-grained and accurate description, and it solves the problem of object hallucination to some extent.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wenqiao Zhang (51 papers)
  2. Xin Eric Wang (74 papers)
  3. Siliang Tang (116 papers)
  4. Haizhou Shi (25 papers)
  5. Haocheng Shi (1 paper)
  6. Jun Xiao (134 papers)
  7. Yueting Zhuang (164 papers)
  8. William Yang Wang (254 papers)
Citations (31)

Summary

We haven't generated a summary for this paper yet.