NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding (2312.07507v2)
Abstract: In the task of emotion recognition from videos, a key improvement has been to focus on emotions over time rather than a single frame. There are many architectures to address this task such as GRUs, LSTMs, Self-Attention, Transformers, and Temporal Convolutional Networks (TCNs). However, these methods suffer from high memory usage, large amounts of operations, or poor gradients. We propose a method known as Neighborhood Attention with Convolutions TCN (NAC-TCN) which incorporates the benefits of attention and Temporal Convolutional Networks while ensuring that causal relationships are understood which results in a reduction in computation and memory cost. We accomplish this by introducing a causal version of Dilated Neighborhood Attention while incorporating it with convolutions. Our model achieves comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and GRUs while requiring fewer parameters on standard emotion recognition datasets. We publish our code online for easy reproducibility and use in other projects.
- Natten – neighborhood attention extension. https://github.com/SHI-Labs/NATTEN, 2023.
- Artificial emotional intelligence in socially assistive robots for older adults: A pilot study. IEEE Transactions on Affective Computing (2022), 1–1.
- Dynamic hand gesture recognition using temporal-stream convolutional neural networks. In 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI) (2020), pp. 132–136.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2021), pp. 6836–6846.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 (2018).
- Trellis networks for sequence modeling. In International Conference on Learning Representations (ICLR) (2019).
- The shattered gradients problem: If resnets are the answer, then what is the question?, 2018.
- End-to-end object detection with transformers, 2020.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
- M2fnet: Multi-modal fusion network for emotion recognition in conversation, 2022.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia 19 (09 2012), 34–31.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Visual robotic perception system with incremental learning for child–robot interaction scenarios. Technologies 9, 4 (2021).
- Deep self-attention network for facial emotion recognition. Procedia Computer Science 171 (2020), 1527–1534. Third International Conference on Computing and Network Communications (CoCoNet’19).
- Temporal convolutional attention-based network for sequence modeling. arXiv preprint arXiv:2002.12530 (2020).
- Temporal convolutional attention-based network for sequence modeling. CoRR abs/2002.12530 (2020).
- Dilated neighborhood attention transformer.
- Neighborhood attention transformer.
- Deep residual learning for image recognition, 2015.
- An empirical exploration of recurrent network architectures. In International conference on machine learning (2015), PMLR, pp. 2342–2350.
- Emotion recognition and its applications. In Human-Computer Systems Interaction: Backgrounds and Applications 3. Springer, 2014, pp. 51–62.
- Kollias, D. Abaw: Valence-arousal estimation, expression recognition, action unit detection and multi-task learning challenges. arXiv preprint arXiv:2202.10659 (2022).
- Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pp. 794–800.
- Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111 (2019).
- Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790 (2021).
- Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision (2019), 1–23.
- Aff-wild2: Extending the aff-wild database for affect recognition, 2019.
- Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855 (2019).
- Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792 (2021).
- Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 3652–3660.
- Afew-va database for valence and arousal estimation in-the-wild. Image and Vision Computing 65 (02 2017).
- Temporal convolutional networks for action segmentation and detection. CoRR abs/1611.05267 (2016).
- Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 156–165.
- Visualizing the loss landscape of neural nets, 2018.
- Network in network, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- Multi-modal emotion estimation for in-the-wild videos. arXiv preprint arXiv:2203.13032 (2022).
- Early recognition of sepsis with gaussian process temporal convolutional networks and dynamic time warping. In Machine Learning for Healthcare Conference (2019), PMLR, pp. 2–26.
- Focus on change: Mood prediction by learning emotion changes via spatio-temporal attention, 2023.
- An ensemble approach for facial behavior analysis in-the-wild video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (June 2022), pp. 2512–2517.
- Emoreact: a multimodal approach and dataset for recognizing emotional responses in children. In Proceedings of the 18th acm international conference on multimodal interaction (2016), pp. 137–144.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
- On the difficulty of training recurrent neural networks. 30th International Conference on Machine Learning, ICML 2013 (11 2012).
- Emotion recognition in conversation: Research challenges, datasets, and recent advances. CoRR abs/1905.02947 (2019).
- Designing network design spaces. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 10425–10433.
- Emotion recognition on large video dataset based on convolutional feature extractor and recurrent neural network. In 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS) (2020), IEEE, pp. 14–20.
- Facial emotion recognition: A multi-task approach using deep learning, 2021.
- Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychological Medicine 52, 5 (2022), 957–967.
- Sovrasov, V. Flops-counter pytorch. https://github.com/sovrasov/flops-counter.pytorch.git, 2021.
- Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 648–656.
- Training data-efficient image transformers and distillation through attention, 2021.
- Attention is all you need, 2017.
- Residual networks behave like ensembles of relatively shallow networks, 2016.
- Stylenat: Giving each head a new perspective.
- Co-scale conv-attentional image transformers, 2021.
- Attention boosted deep networks for video classification. In 2020 IEEE International Conference on Image Processing (ICIP) (2020), pp. 1761–1765.
- Multi-scale context aggregation by dilated convolutions, 2016.
- Norm-preservation: Why residual networks can become extremely deep?, 2020.
- Continuous emotion recognition using visual-audio-linguistic information: A technical report for abaw3, 2022.
- Multimodal continuous emotion recognition: A technical report for abaw5. arXiv preprint arXiv:2303.10335 (2023).
- Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE ACM Trans. Audio Speech Lang. Process. 28 (May 2020), 1598–1607.