Location-aware Graph Convolutional Networks for Video Question Answering (2008.09105v1)

Published 7 Aug 2020 in cs.CV and cs.CL

Abstract: We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets. Code and pre-trained models are publicly available at: https://github.com/SunDoge/L-GCN

PDF Abstract

Location-aware Graph Convolutional Networks for Video Question Answering

The paper proposes a novel method for video question answering (VQA) by introducing Location-aware Graph Convolutional Networks (L-GCNs). This approach aims to enhance the reasoning capabilities required to answer questions about videos in natural language by focusing on the relationships and interactions between objects across temporal sequences within videos, rather than merely relying on spatio-temporal features of individual video frames.

Overview of L-GCNs

Video question answering presents unique challenges, such as understanding dynamic interactions and temporal sequences within lengthy video data, often populated with irrelevant background content and actions interspersed across frames. The paper argues that existing methods, which emphasize spatio-temporal attention mechanisms, fall short of effectively reasoning about object interactions and their location information critical for comprehensive action recognition and question answering.

The proposed method capitalizes on an object-based graph representation, where each node in the graph is comprised of an object's appearance and location features. Object features are derived using a pre-trained object detector, and location information is encoded through both spatial dimensions and temporal sequences using sine and cosine functions. These combined features permit the construction of a graph that not only identifies objects but also their spatial and temporal attributes, empowering the graph with location awareness.

Methodological Contributions

Graph-based Representation: The contents of a video are represented as a fully-connected graph wherein nodes correspond to objects detected by an object detector. Edges denote relationships between objects, facilitating interaction extraction.
Location Encoding: Spatial and temporal location features are effectively embedded into nodes, allowing the network to be privy to the positional context of actions and facilitating reasoning with temporal cues, crucial for accurate QA pair prediction.
Graph Convolution: Graph convolution is applied to model the interactions between objects, enabling direct communication between all nodes, irrespective of adjacency in the video sequence, thereby ensuring robust action representation and enhanced reasoning.
Attention Mechanism: An attention module merges the queried question and graph node outputs, assisting in focusing on relevant visual content by aligning question features with location-aware graph outputs.
Comprehensive Evaluation: The paper demonstrates methodology efficacy through extensive experiments across several benchmark datasets—TGIF-QA, Youtube2Text-QA, and MSVD-QA—showcasing competitive accuracy and improvements over existing state-of-the-art methods.

Experimental Results

The empirical results reveal that L-GCNs achieve superior performance across distinct datasets, notably surpassing prior methods in TGIF-QA tasks such as action, transition, frameQA, and count. This further corroborates the robustness of leveraging location information and object interaction modeling for video QA. Notably, L-GCNs outperform multiple baselines even without relying on additional dynamic features like C3D or optical flow, substantiating the efficacy of location-infused graphs for action inference and question reasoning. Particularly noteworthy was the method's success in open-ended questions in Youtube2Text-QA, significantly improving recognition accuracy for 'who' tasks owing to its effective object localization.

Implications and Future Directions

The proposed methodology heralds significant advancements in understanding video content for question answering. By directly addressing the interactions and locating objects temporally and spatially, it opens pathways for richer, contextually aware models in AI-driven video analysis. Future developments may include integrating more complex temporal reasoning or extending the method to other domains requiring intricate spatio-temporal understanding, such as autonomous systems and interactive media generation.

Overall, the paper contributes a vital step forward in leveraging graph-based methodologies within deep learning frameworks to enhance machine comprehension in video datasets, potentially influencing a wide array of applications including automated storytelling, human-robot interaction, and intelligent surveillance systems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Deng Huang (7 papers)
Peihao Chen (28 papers)
Runhao Zeng (18 papers)
Qing Du (14 papers)
Mingkui Tan (124 papers)
Chuang Gan (195 papers)

Citations (168)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - SunDoge/L-GCN: PyTorch implementation of L-GCN [https://arxiv.org/abs/2008.09105] (25 stars)