Relational Long Short-Term Memory for Video Action Recognition (1811.07059v2)

Published 16 Nov 2018 in cs.CV

Abstract: Spatial and temporal relationships, both short-range and long-range, between objects in videos, are key cues for recognizing actions. It is a challenging problem to model them jointly. In this paper, we first present a new variant of Long Short-Term Memory, namely Relational LSTM, to address the challenge of relation reasoning across space and time between objects. In our Relational LSTM module, we utilize a non-local operation similar in spirit to the recently proposed non-local network to substitute the fully connected operation in the vanilla LSTM. By doing this, our Relational LSTM is capable of capturing long and short-range spatio-temporal relations between objects in videos in a principled way. Then, we propose a two-branch neural architecture consisting of the Relational LSTM module as the non-local branch and a spatio-temporal pooling based local branch. The local branch is utilized for capturing local spatial appearance and/or short-term motion features. The two branches are concatenated to learn video-level features from snippet-level ones which are then used for classification. Experimental results on UCF-101 and HMDB-51 datasets show that our model achieves state-of-the-art results among LSTM-based methods, while obtaining comparable performance with other state-of-the-art methods (which use not directly comparable schema). Further, on the more complex large-scale Charades dataset, we obtain a large 3.2% gain over state-of-the-art methods, verifying the effectiveness of our method in complex understanding.

Authors (4)

Zexi Chen (26 papers)
Bharathkumar Ramachandra (8 papers)
Tianfu Wu (63 papers)
Ranga Raju Vatsavai (11 papers)

Citations (5)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Relational Long Short-Term Memory for Video Action Recognition (1811.07059v2)

Summary

Related Papers