Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting Audios Using Acoustic Properties For Emotion Representation (2310.02298v3)

Published 3 Oct 2023 in cs.SD, cs.AI, and eess.AS

Abstract: Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Robert Plutchik, The emotions, University Press of America, 1991.
  2. Paul Ekman, Are there basic emotions?, American Psychological Association, 1992.
  3. “Detecting gender differences in perception of emotion in crowdsourced data,” arXiv preprint arXiv:1910.11386, 2019.
  4. “The phonetic bases of vocal expressed emotion: Natural versus acted,” Proc. Interspeech 2020, pp. 3451–3455, 2020.
  5. Robert W Frick, “Communicating emotion: The role of prosodic features.,” Psychological bulletin, vol. 97, no. 3, pp. 412, 1985.
  6. Klaus R Scherer, “Acoustic concomitants of emotional dimensions: Judging affect from synthesized tone sequences.,” 1972.
  7. Aneta Pavlenko, Emotions and multilingualism., Cambridge University Press, 2005.
  8. “Clap: Learning audio concepts from natural language supervision,” arXiv preprint arXiv:2206.04769, 2022.
  9. “Audio retrieval with wavtext5k and clap training,” arXiv preprint arXiv:2209.14275, 2022.
  10. “Positional encoding for capturing modality specific cadence for emotion detection,” Proc. Interspeech 2022, pp. 166–170, 2022.
  11. “Research on emotional semantic retrieval of attention mechanism oriented to audio-visual synesthesia,” Neurocomputing, vol. 519, pp. 194–204, 2023.
  12. “Emomv: Affective music-video correspondence learning datasets for classification and retrieval,” Information Fusion, vol. 91, pp. 64–79, 2023.
  13. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  14. “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
  15. “Clotho: an audio captioning dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  16. “AudioCaps: Generating Captions for Audios in The Wild,” in NAACL-HLT, 2019.
  17. “What is the ground truth? reliability of multi-annotator data for audio tagging,” in 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 2021, pp. 76–80.
  18. “Fsd50k: An open dataset of human-labeled sound events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
  19. “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.
  20. “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  21. “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” arXiv preprint arXiv:1810.02508, 2018.
  22. “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
  23. “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, pp. e0196391, 2018.
  24. “The frequency range of the voice fundamental in the speech of male and female adults,” Unpublished manuscript, vol. 11, 1995.
  25. “Attending at a low intensity increases impulsivity in an auditory sustained attention to response task,” Perception, vol. 44, no. 12, pp. 1371–1382, 2015.
  26. “Speech rate in parkinson’s disease: A controlled study,” Neurología (English Edition), vol. 31, no. 7, pp. 466–472, 2016.
  27. “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, 2015, vol. 8, pp. 18–25.
  28. “Praat,” https://www.fon.hum.uva.nl/praat.
  29. “Improving content-based audio retrieval by vocal imitation feedback,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 4100–4104.
  30. “A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset,” Applied Sciences, vol. 12, no. 1, pp. 327, 2021.
Citations (3)

Summary

We haven't generated a summary for this paper yet.