Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks (1906.12158v1)

Published 28 Jun 2019 in cs.CV and cs.LG

Abstract: Open-ended video question answering aims to automatically generate the natural-language answer from referenced video contents according to the given question. Currently, most existing approaches focus on short-form video question answering with multi-modal recurrent encoder-decoder networks. Although these works have achieved promising performance, they may still be ineffectively applied to long-form video question answering due to the lack of long-range dependency modeling and the suffering from the heavy computational cost. To tackle these problems, we propose a fast Hierarchical Convolutional Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context. We then devise a multi-scale attentive decoder to incorporate multi-layer video representations for answer generation, which avoids the information missing of the top encoder layer. The extensive experiments show the effectiveness and efficiency of our method.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhu Zhang (39 papers)
  2. Zhou Zhao (219 papers)
  3. Zhijie Lin (30 papers)
  4. Jingkuan Song (115 papers)
  5. Xiaofei He (70 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.