Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation (2106.08596v1)
Abstract: This paper presents an approach for Evoked Expressions from Videos (EEV) challenge, which aims to predict evoked facial expressions from video. We take advantage of pre-trained models on large-scale datasets in computer vision and audio signals to extract the deep representation of timestamps in the video. A temporal convolution network, rather than an RNN like architecture, is used to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the missing annotations of some timestamps, positional encoding is employed to ensure continuity of input data when discarding these timestamps during training. We achieved state-of-the-art results on the EEV challenge with a Pearson correlation coefficient of 0.05477, the first ranked performance in the EEV 2021 challenge.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
- Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- William A Falcon et al. PyTorch Lightning. GitHub. Note: https://github. com/williamFalcon/pytorch-lightning, 3, 2019.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017.
- Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
- Affect2mm: Affective analysis of multimedia content using emotion causality. arXiv preprint arXiv:2103.06541, 2021.
- PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
- Towards Learning a Universal Non-Semantic Representation of Speech. In Proc. Interspeech 2020, pages 140–144, 2020.
- EEV dataset: Predicting expressions evoked by diverse videos. arXiv preprint arXiv:2001.05488, 2020.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
- Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 8719–8726. IEEE, 2021.
- Multi-modal learning for affective content analysis in movies. Multimedia Tools and Applications, 78(10):13331–13350, 2019.