2000 character limit reached
Deep functional multiple index models with an application to SER (2403.17562v1)
Published 26 Mar 2024 in cs.SD, eess.AS, and stat.AP
Abstract: Speech Emotion Recognition (SER) plays a crucial role in advancing human-computer interaction and speech processing capabilities. We introduce a novel deep-learning architecture designed specifically for the functional data model known as the multiple-index functional model. Our key innovation lies in integrating adaptive basis layers and an automated data transformation search within the deep learning framework. Simulations for this new model show good performances. This allows us to extract features tailored for chunk-level SER, based on Mel Frequency Cepstral Coefficients (MFCCs). We demonstrate the effectiveness of our approach on the benchmark IEMOCAP database, achieving good performance compared to existing methods.
- B. F. P. Dossou and Y. K. S. Gbenou, “Fser: Deep convolutional neural networks for speech emotion recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2021, pp. 3533–3538.
- Z. Peng, Y. Lu, S. Pan, and Y. Liu, “Efficient speech emotion recognition using multi-scale CNN and attention,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, jun 2021.
- W. Zhu and X. Li, “Speech emotion recognition with global-aware fusion on multi-scale feature representation,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, may 2022.
- L. Goncalves and C. Busso, “Improving speech emotion recognition using self-supervised learning with domain-specific audiovisual tasks,” Proc. Interspeech 2022, pp. 1168–1172, 2022.
- E. Morais, R. Hoory, W. Zhu, I. Gat, M. Damasceno, and H. Aronowitz, “Speech emotion recognition using self-supervised features,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, may 2022.
- S. Robbiano, M. Saumard, and M. Curé, “Improving prediction performance of stellar parameters using functional models,” Journal of Applied Statistics, vol. 43, no. 8, pp. 1465–1476, 2016.
- W. Saeys, B. De Ketelaere, and P. Darius, “Potential applications of functional data analysis in chemometrics,” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 22, no. 5, pp. 335–344, 2008.
- R. Cao, L. Horváth, Z. Liu, and Y. Zhao, “A study of data-driven momentum and disposition effects in the chinese stock market by functional data analysis,” Review of Quantitative Finance and Accounting, vol. 54, no. 1, pp. 335–358, 2020.
- D. Bosq, “Estimation of mean and covariance operator of autoregressive processes in banach spaces,” Statistical Inference for Stochastic Processes, vol. 5, no. 3, pp. 287–306, 2002.
- D. Chen and H.-G. Müller, “Nonlinear manifold representations for functional data,” The Annals of Statistics, vol. 40, no. 1, pp. 1–29, 2012.
- J. P. Arias, C. Busso, and N. B. Yoma, “Energy and F0 contour modeling with functional data analysis for emotional speech detection,” in Proc. Interspeech 2013, 2013, pp. 2871–2875.
- ——, “Shape-based modeling of the fundamental frequency contour for emotion detection in speech,” Computer Speech & Language, vol. 28, no. 1, pp. 278–294, jan 2014.
- S. Tavakoli, D. Pigoli, J. A. D. Aston, and J. S. Coleman, “A spatial modeling approach for linguistic object data: Analyzing dialect sound variations across great britain,” Journal of the American Statistical Association, vol. 114, no. 527, pp. 1081–1096, jul 2019.
- W.-C. Lin and C. Busso, “An efficient temporal modeling approach for speech emotion recognition by mapping varied duration sentences into fixed number of chunks,” Interspeech 2020, 2020.
- P. Kumawat and A. Routray, “Applying tdnn architectures for analyzing duration dependencies on speech emotion recognition.” in Interspeech, 2021, pp. 3410–3414.
- W.-C. Lin and C. Busso, “Chunk-level speech emotion recognition: A general framework of sequence-to-one dynamic temporal modeling,” IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1215–1227, 2023.
- F. Jiang, S. Baek, J. Cao, and Y. Ma, “A functional single-index model,” Statistica sinica, vol. 30, no. 1, pp. 303–324, 2020.
- C.-R. Jiang and J.-L. Wang, “Functional single index models for longitudinal data,” The Annals of Statistics, vol. 39, no. 1, pp. 362–388, 2011. [Online]. Available: http://www.jstor.org/stable/29783641
- F. Ferraty, J. Park, and P. Vieu, “Estimation of a functional single index model,” in Recent advances in functional data analysis and related topics. Springer, 2011, pp. 111–116.
- H.-G. Müller and F. Yao, “Functional additive models,” Journal of the American Statistical Association, vol. 103, no. 484, pp. 1534–1544, 2008.
- R. K. Wong, Y. Li, and Z. Zhu, “Partially linear functional additive models for multivariate functional data,” Journal of the American Statistical Association, vol. 114, no. 525, pp. 406–418, 2019.
- A. R. Rao and M. Reimherr, “Nonlinear functional modeling using neural networks,” Journal of Computational and Graphical Statistics, pp. 1–10, 2023.
- J. Yao, J. Mueller, and J.-L. Wang, “Deep learning for functional data analysis with adaptive basis layers,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 11 898–11 908. [Online]. Available: https://proceedings.mlr.press/v139/yao21c.html
- F. Rossi, B. Conan-Guez, and F. Fleuret, “Functional data analysis with multi layer perceptrons,” in Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), vol. 3. IEEE, 2002, pp. 2843–2848.
- Q. Wang, S. Zheng, A. Farahat, S. Serita, T. Saeki, and C. Gupta, “Multilayer perceptron for sparse functional data,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–10.
- J. Kim, Y. An, and J. Kim, “Improving speech emotion recognition through focus and calibration attention mechanisms,” in Proc. Interspeech 2022, 2022, pp. 136–140.
- N. R. Prabhu, G. Carbajal, N. Lehmann-Willenbrock, and T. Gerkmann, “End-to-end label uncertainty modeling for speech-based arousal recognition using bayesian neural networks,” in Proc. Interspeech 2022, 2022, pp. 151–155.
- H. Dhamyal, B. Raj, and R. Singh, “Positional encoding for capturing modality specific cadence for emotion detection,” in Proc. Interspeech 2022, 2022, pp. 166–170.
- M. Perez, M. Jaiswal, M. Niu, C. Gorrostieta, M. Roddy, K. Taylor, R. Lotfian, J. Kane, and E. M. Provost, “Mind the gap: On the value of silence representations to lexical-based speech emotion recognition,” in Proc. Interspeech 2022, 2022, pp. 156–160.
- E. Vaaras, M. Airaksinen, and O. Räsänen, “Analysis of self-supervised learning and dimensionality reduction methods in clustering-based active learning for speech emotion recognition,” in Proc. Interspeech 2022, 2022, pp. 1143–1147.
- M. Baruah and B. Banerjee, “Speech emotion recognition via generation using an attention-based variational recurrent neural network,” in Proc. Interspeech 2022, 2022, pp. 4710–4714.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- N. Antoniou, A. Katsamanis, T. Giannakopoulos, and S. Narayanan, “Designing and evaluating speech emotion recognition systems: A reality check case study with iemocap,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- J.-K. H. Hyun-Sam Shin, “Performance analysis of a chunk-based speech emotion recognition model using rnn,” Intelligent Automation & Soft Computing, vol. 36, no. 1, pp. 235–248, 2023. [Online]. Available: http://www.techscience.com/iasc/v36n1/50034