Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SVFAP: Self-supervised Video Facial Affect Perceiver (2401.00416v2)

Published 31 Dec 2023 in cs.CV, cs.HC, and cs.MM

Abstract: Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction. Previous studies mainly focus on developing various deep learning architectures and training them in a fully supervised manner. Although significant progress has been achieved by these supervised methods, the longstanding lack of large-scale high-quality labeled data severely hinders their further improvements. Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised methods. Specifically, SVFAP leverages masked facial video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos. Considering that large spatiotemporal redundancy exists in facial videos, we propose a novel temporal pyramid and spatial bottleneck Transformer as the encoder of SVFAP, which not only largely reduces computational costs but also achieves excellent performance. To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition. Comprehensive results demonstrate that SVFAP can learn powerful affect-related representations via large-scale self-supervised pre-training and it significantly outperforms previous state-of-the-art methods on all datasets. Code is available at https://github.com/sunlicai/SVFAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial expressions: The state of the art,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 12, pp. 1424–1445, 2000.
  2. P. V. Rouast, M. T. Adam, and R. Chiong, “Deep learning for human affect recognition: Insights and new developments,” IEEE Transactions on Affective Computing, vol. 12, no. 2, pp. 524–543, 2019.
  3. S. Zhao, X. Yao, J. Yang, G. Jia, G. Ding, T.-S. Chua, B. W. Schuller, and K. Keutzer, “Affective image content analysis: Two decades review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6729–6751, 2021.
  4. Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM international conference on multimodal interaction, 2016, pp. 445–450.
  5. X. Jiang, Y. Zong, W. Zheng, C. Tang, W. Xia, C. Lu, and J. Liu, “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2881–2889.
  6. S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent neural networks for emotion recognition in video,” in Proceedings of the 2015 ACM on international conference on multimodal interaction, 2015, pp. 467–474.
  7. L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen, “Long short term memory recurrent neural network based multimodal dimensional emotion recognition,” in Proceedings of the 5th international workshop on audio/visual emotion challenge, 2015, pp. 65–72.
  8. R. Zhao, T. Liu, Z. Huang, D. P. Lun, and K.-M. Lam, “Spatial-temporal graphs plus transformers for geometry-guided facial expression recognition,” IEEE Transactions on Affective Computing, 2022.
  9. Y. Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y. Zhan, “Expression snippet transformer for robust video-based facial expression recognition,” arXiv preprint arXiv:2109.08409, 2021.
  10. D. Kollias and S. Zafeiriou, “Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset,” IEEE Transactions on Affective Computing, vol. 12, no. 3, pp. 595–606, 2020.
  11. L. Sun, Z. Lian, J. Tao, B. Liu, and M. Niu, “Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism,” in Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop, 2020, pp. 27–34.
  12. Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recognition transformer,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561.
  13. F. Ma, B. Sun, and S. Li, “Spatio-temporal transformer for dynamic facial expression recognition in the wild,” arXiv preprint arXiv:2205.04749, 2022.
  14. S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE transactions on affective computing, vol. 13, no. 3, pp. 1195–1215, 2020.
  15. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021.
  16. R. Lotfian and C. Busso, “Formulating emotion perception as a probabilistic model with application to categorical emotion classification,” in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).   IEEE, 2017, pp. 415–420.
  17. V. Sethu, E. M. Provost, J. Epps, C. Busso, N. Cummins, and S. Narayanan, “The ambiguous world of emotion representation,” arXiv preprint arXiv:1909.00360, 2019.
  18. X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 1, pp. 857–876, 2021.
  19. L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 11, pp. 4037–4058, 2020.
  20. C. Zhang, C. Zhang, J. Song, J. S. K. Yi, K. Zhang, and I. S. Kweon, “A survey on masked autoencoder for self-supervised learning in vision and beyond,” arXiv preprint arXiv:2208.00173, 2022.
  21. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  22. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
  23. Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” in Advances in Neural Information Processing Systems, 2022.
  24. C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as spatiotemporal learners,” arXiv preprint arXiv:2205.09113, 2022.
  25. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  26. Y. Wang, Y. Sun, Y. Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, and W. Zhang, “Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 922–20 931.
  27. Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, and S. Shan, “Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 24–32.
  28. Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, and W. Zhang, “Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110.
  29. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  30. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  31. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  32. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  33. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  34. H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic facial expression recognition in the wild,” arXiv preprint arXiv:2208.10335, 2022.
  35. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  36. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  37. Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533–5541.
  38. K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
  39. N. Komodakis and S. Gidaris, “Unsupervised representation learning by predicting image rotations,” in International conference on learning representations (ICLR), 2018.
  40. H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 667–676.
  41. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  42. M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924, 2020.
  43. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.
  44. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  45. H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
  46. S. Roy and A. Etemad, “Spatiotemporal contrastive learning of facial expressions in videos,” in 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII).   IEEE, 2021, pp. 1–8.
  47. O. Wiles, A. Koepke, and A. Zisserman, “Self-supervised learning of a facial attribute embedding from video,” arXiv preprint arXiv:1808.06882, 2018.
  48. Y. Li, J. Zeng, S. Shan, and X. Chen, “Self-supervised representation learning from videos for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer vision and pattern recognition, 2019, pp. 10 924–10 933.
  49. J.-R. Chang, Y.-S. Chen, and W.-C. Chiu, “Learning facial representations from the cycle-consistency of face,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9680–9689.
  50. L. Lu, L. Tavabi, and M. Soleymani, “Self-supervised learning for facial action unit recognition through temporal consistency,” in BMVC, 2020.
  51. A. Bulat, S. Cheng, J. Yang, A. Garbett, E. Sanchez, and G. Tzimiropoulos, “Pre-training strategies and datasets for facial representation learning,” in European Conference on Computer Vision.   Springer, 2022, pp. 107–125.
  52. Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, and F. Wen, “General facial representation learning in a visual-linguistic manner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 697–18 709.
  53. K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu et al., “A survey on vision transformer,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 87–110, 2022.
  54. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning.   PMLR, 2021, pp. 10 347–10 357.
  55. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  56. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.
  57. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.
  58. Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4804–4814.
  59. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.
  60. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  61. A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” in International conference on machine learning.   PMLR, 2021, pp. 4651–4664.
  62. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” Proc. Interspeech 2018, pp. 1086–1090, 2018.
  63. H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
  64. S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018.
  65. O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion database,” in 22nd International Conference on Data Engineering Workshops (ICDEW’06).   IEEE, 2006, pp. 8–8.
  66. L. Su, C. Hu, G. Li, and D. Cao, “Msaf: Multimodal split attention fusion,” arXiv preprint arXiv:2012.07175, 2020.
  67. Z. Fu, F. Liu, H. Wang, J. Qi, X. Fu, A. Zhou, and Z. Li, “A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition,” arXiv preprint arXiv:2111.02172, 2021.
  68. Y. Liu, E. Sangineto, W. Bi, N. Sebe, B. Lepri, and M. Nadai, “Efficient training of visual transformers with small datasets,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 818–23 830, 2021.
  69. N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “What makes instance discrimination good for transfer learning?” in International Conference on Learning Representations, 2020.
  70. Y. Liu, C. Feng, X. Yuan, L. Zhou, W. Wang, J. Qin, and Z. Luo, “Clip-aware expressive feature learning for video-based facial expression recognition,” Information Sciences, vol. 598, pp. 182–195, 2022.
  71. H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dynamic facial expression recognition,” arXiv preprint arXiv:2206.04975, 2022.
  72. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
  73. E. Ghaleb, M. Popa, and S. Asteriadis, “Multimodal and temporal perception of audio-visual cues for emotion recognition,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).   IEEE, 2019, pp. 552–558.
  74. L. Goncalves and C. Busso, “Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features,” IEEE Transactions on Affective Computing, vol. 13, no. 04, pp. 2156–2170, 2022.
  75. Y. Lei and H. Cao, “Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels,” IEEE Transactions on Affective Computing, pp. 1–16, 2023.
  76. A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
  77. M. Tran and M. Soleymani, “A pre-trained audio-visual transformer for emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 4698–4702.
  78. M. Mansoorizadeh and N. Moghaddam Charkari, “Multimodal information fusion application to human emotion recognition from face and speech,” Multimedia Tools and Applications, vol. 49, no. 2, pp. 277–297, 2010.
  79. Y.-H. Byeon and K.-C. Kwak, “Facial expression recognition using 3d convolutional neural network,” International journal of advanced computer science and applications, vol. 5, no. 12, 2014.
  80. S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “Baum-1: A spontaneous audio-visual face database of affective and mental states,” IEEE Transactions on Affective Computing, vol. 8, no. 3, pp. 300–313, 2016.
  81. D. Meng, X. Peng, K. Wang, and Y. Qiao, “Frame attention networks for facial expression recognition in videos,” in 2019 IEEE international conference on image processing (ICIP).   IEEE, 2019, pp. 3866–3870.
  82. X. Pan, G. Ying, G. Chen, H. Li, and W. Li, “A deep spatial and temporal aggregation framework for video-based facial expression recognition,” IEEE Access, vol. 7, pp. 48 807–48 815, 2019.
  83. X. Pan, W. Guo, X. Guo, W. Li, J. Xu, and J. Wu, “Deep temporal–spatial aggregation for video-based facial expression recognition,” Symmetry, vol. 11, no. 1, p. 52, 2019.
  84. R. Miyoshi, N. Nagata, and M. Hashimoto, “Facial-expression recognition from video using enhanced convolutional lstm,” in 2019 Digital Image Computing: Techniques and Applications (DICTA).   IEEE, 2019, pp. 1–6.
  85. R. Miyoshi, N. Nagata, and M. Hashimoto, “Enhanced convolutional lstm with spatial and temporal skip connections and temporal gates for facial expression recognition from video,” Neural Computing and Applications, vol. 33, no. 13, pp. 7381–7392, 2021.
  86. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1.   Ieee, 2005, pp. 886–893.
  87. K. Zhang, X. Wu, X. Xie, X. Zhang, H. Zhang, X. Chen, and L. Sun, “Werewolf-xl: A database for identifying spontaneous affect in large competitive group interactions,” IEEE Transactions on Affective Computing, vol. 14, no. 02, pp. 1201–1214, 2023.
  88. P. Sarkar, A. Posen, and A. Etemad, “Avcaffe: A large scale audio-visual dataset of cognitive load and affect for remote work,” arXiv preprint arXiv:2205.06887, 2022.
  89. X.-S. Wei, C.-L. Zhang, H. Zhang, and J. Wu, “Deep bimodal regression of apparent personality traits from short video sequences,” IEEE Transactions on Affective Computing, vol. 9, no. 3, pp. 303–315, 2017.
  90. Y. Güçlütürk, U. Güçlü, M. A. van Gerven, and R. van Lier, “Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition,” in European conference on computer vision.   Springer, 2016, pp. 349–358.
  91. Y. Li, J. Wan, Q. Miao, S. Escalera, H. Fang, H. Chen, X. Qi, and G. Guo, “Cr-net: A deep classification-regression network for multimodal apparent personality analysis,” International Journal of Computer Vision, vol. 128, no. 12, pp. 2763–2780, 2020.
  92. C. Ventura, D. Masip, and A. Lapedriza, “Interpreting cnn models for apparent personality trait regression,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 55–63.
  93. L. Zhang, S. Peng, and S. Winkler, “Persemon: a deep network for joint analysis of apparent personality, emotion and their relationship,” IEEE Transactions on Affective Computing, 2019.
  94. C. Suman, S. Saha, A. Gupta, S. K. Pandey, and P. Bhattacharyya, “A multi-modal personality prediction system,” Knowledge-Based Systems, vol. 236, p. 107715, 2022.
  95. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  96. J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
  97. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.
  98. C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 591–600.
  99. R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 244–253.
  100. A. Subramaniam, V. Patel, A. Mishra, P. Balasubramanian, and A. Mittal, “Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features,” in European conference on computer vision.   Springer, 2016, pp. 337–348.
  101. V. Ponce-López, B. Chen, M. Oliu, C. Corneanu, A. Clapés, I. Guyon, X. Baró, H. J. Escalante, and S. Escalera, “Chalearn lap 2016: First round challenge on first impressions-dataset and results,” in European conference on computer vision.   Springer, 2016, pp. 400–418.
  102. R. Liao, S. Song, and H. Gunes, “An open-source benchmark of deep learning models for audio-visual apparent and self-reported personality recognition,” arXiv preprint arXiv:2210.09138, 2022.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com