Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering (2108.05158v1)

Published 11 Aug 2021 in cs.CV and cs.AI

Abstract: Video question answering has recently received a lot of attention from multimodal video researchers. Most video question answering datasets are usually in the form of multiple-choice. But, the model for the multiple-choice task does not infer the answer. Rather it compares the answer candidates for picking the correct answer. Furthermore, it makes it difficult to extend to other tasks. In this paper, we challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. The model is fine-tuned with video inputs and subtitles. An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Donggeon Lee (9 papers)
  2. Seongho Choi (9 papers)
  3. Youwon Jang (4 papers)
  4. Byoung-Tak Zhang (83 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.