Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos (2003.00832v1)

Published 12 Feb 2020 in cs.CV, cs.HC, and cs.MM

Abstract: Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sicheng Zhao (53 papers)
  2. Yunsheng Ma (26 papers)
  3. Yang Gu (18 papers)
  4. Jufeng Yang (21 papers)
  5. Tengfei Xing (9 papers)
  6. Pengfei Xu (57 papers)
  7. Runbo Hu (8 papers)
  8. Hua Chai (13 papers)
  9. Kurt Keutzer (200 papers)
Citations (88)

Summary

We haven't generated a summary for this paper yet.