AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts (2403.13678v1)
Abstract: Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. To overcome these shortcomings, we propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Moreover, this paper adaptively captures fusion features across modalities by modeling the temporal relationships, and ultilizes a pre-trained GPT-2 model for sophisticated context-aware fusion of multimodal information. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios. These findings underscore the potential of integrating temporal dynamics and contextual interpretation, paving the way for future research endeavors.
- Partial fc: Training 10 million identities on a single machine. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1445–1449, 2021.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
- Recognition of action units in the wild with deep nets and a new global-local loss. In ICCV, pages 3990–3999, 2017.
- Selective transfer machine for personalized facial action unit detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3515–3522, 2013.
- Facial action unit event detection by cascade of tasks. In Proceedings of the IEEE international conference on computer vision, pages 2400–2407, 2013.
- Improved residual networks for image and video recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9415–9422. IEEE, 2021.
- Multi-conditional latent variable model for joint facial action unit detection. In Proceedings of the IEEE international conference on computer vision, pages 3792–3800, 2015.
- Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645. Springer, 2016.
- Facial action unit detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7680–7689, 2021.
- Dimitrios Kollias. Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2328–2336, 2022.
- Dimitrios Kollias. Abaw: Learning from synthetic data & multi-task learning challenges. In European Conference on Computer Vision, pages 157–172. Springer, 2023.
- Dimitrios Kollias. Multi-label compound expression recognition: C-expr database & network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5589–5598, 2023.
- Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 637–643. IEEE, 2020.
- Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111, 2019.
- Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790, 2021.
- Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. arXiv preprint arXiv:2303.01498, 2023.
- The 6th affective behavior analysis in-the-wild (abaw) competition. arXiv preprint arXiv:2402.19344, 2024.
- Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pages 1–23, 2019.
- Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855, 2019.
- Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792, 2021.
- Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3652–3660, 2021.
- Gpt2: Empirical slant delay model for radio space geodetic techniques. Geophysical research letters, 40(6):1069–1073, 2013.
- Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018.
- Au-aware deep networks for facial expression recognition. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–6. IEEE, 2013.
- Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782, 2022.
- Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE access, 7:125868–125881, 2019.
- A transformer-based approach to video frame-level prediction in affective behaviour analysis in-the-wild, 2023.
- Facial action coding system. Environmental Psychology & Nonverbal Behavior, 2015.
- Andrey V Savchenko. Emotieffnet facial features in uni-task emotion recognition in video at abaw-5 competition. arXiv preprint arXiv:2303.09162, 2023.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12899–12908, 2021.
- Vibha Tiwari. Mfcc and its applications in speaker recognition. International journal on emerging technologies, 1(1):19–22, 2010.
- Facial affective behavior analysis method for 5th abaw competition, 2023.
- Capturing global semantic relationships for facial action unit recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3304–3311, 2013.
- Spatio-temporal au relational graph representation learning for facial action units detection, 2023.
- Exploiting semantic embedding and visual feature for facial action unit detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10482–10491, 2021.
- Multi-modal facial action unit detection with large pre-trained models for the 5th competition on affective behavior analysis in-the-wild, 2023.
- Local region perception and relationship learning combined with feature fusion for facial action unit detection, 2023.
- Aff-wild: Valence and arousal ‘in-the-wild’challenge. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1980–1987. IEEE, 2017.
- Facial affective analysis based on mae and multi-modal information for 5th abaw competition. arXiv preprint arXiv:2303.10849, 2023.
- One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems, 36, 2024.
- Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, June 2023.
- Jun Yu (233 papers)
- Zerui Zhang (23 papers)
- Zhihong Wei (4 papers)
- Gongpeng Zhao (8 papers)
- Zhongpeng Cai (8 papers)
- Yongqi Wang (24 papers)
- Guochen Xie (11 papers)
- Jichao Zhu (10 papers)
- Wangyuan Zhu (8 papers)