Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation (2106.08596v1)

Published 16 Jun 2021 in cs.CV and cs.HC

Abstract: This paper presents an approach for Evoked Expressions from Videos (EEV) challenge, which aims to predict evoked facial expressions from video. We take advantage of pre-trained models on large-scale datasets in computer vision and audio signals to extract the deep representation of timestamps in the video. A temporal convolution network, rather than an RNN like architecture, is used to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the missing annotations of some timestamps, positional encoding is employed to ensure continuity of input data when discarding these timestamps during training. We achieved state-of-the-art results on the EEV challenge with a Pearson correlation coefficient of 0.05477, the first ranked performance in the EEV 2021 challenge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
  2. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
  3. William A Falcon et al. PyTorch Lightning. GitHub. Note: https://github. com/williamFalcon/pytorch-lightning, 3, 2019.
  4. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017.
  5. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
  6. Affect2mm: Affective analysis of multimedia content using emotion causality. arXiv preprint arXiv:2103.06541, 2021.
  7. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
  8. Towards Learning a Universal Non-Semantic Representation of Speech. In Proc. Interspeech 2020, pages 140–144, 2020.
  9. EEV dataset: Predicting expressions evoked by diverse videos. arXiv preprint arXiv:2001.05488, 2020.
  10. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
  11. Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 8719–8726. IEEE, 2021.
  12. Multi-modal learning for affective content analysis in movies. Multimedia Tools and Applications, 78(10):13331–13350, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.