Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection (2402.13276v2)

Published 17 Feb 2024 in eess.AS, cs.AI, and cs.SD

Abstract: Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, LLMs stand out for their versatility in mental healthcare applications. However, their primary limitation arises from their exclusive dependence on textual input, which constrains their overall capabilities. Furthermore, the utilization of LLMs in identifying and analyzing depressive states is still relatively untapped. In this paper, we present an innovative approach to integrating acoustic speech information into the LLMs framework for multimodal depression detection. We investigate an efficient method for depression detection by integrating speech signals into LLMs utilizing Acoustic Landmarks. By incorporating acoustic landmarks, which are specific to the pronunciation of spoken words, our method adds critical dimensions to text transcripts. This integration also provides insights into the unique speech patterns of individuals, revealing the potential mental states of individuals. Evaluations of the proposed approach on the DAIC-WOZ dataset reveal state-of-the-art results when compared with existing Audio-Text baselines. In addition, this approach is not only valuable for the detection of depression but also represents a new perspective in enhancing the ability of LLMs to comprehend and process speech signals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Algorithms for Hyper-Parameter Optimization. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
  2. Speechmark: Landmark detection tool for speech analysis. In Thirteenth Annual Conference of the International Speech Communication Association.
  3. Language models are few-shot learners. In Proc. Adv. Neural Inf. Process. Syst., volume 33, pages 1877–1901.
  4. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  6. An investigation of depressed speech detection: Features and normalization. In Twelfth Annual Conference of the International Speech Communication Association.
  7. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
  8. Simsensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1061–1068.
  9. Paul L Garvin. 1953. Preliminaries to speech analysis: The distinctive features and their correlates.
  10. Yuan Gong and Christian Poellabauer. 2017. Topic modeling based multi-modal depression detection. In Proceedings of the 7th annual workshop on Audio/Visual emotion challenge, pages 69–76.
  11. Boosting large language model for speech synthesis: An empirical study. arXiv preprint arXiv:2401.00246.
  12. When ctc training meets acoustic landmarks. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5996–6000. IEEE.
  13. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.
  14. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  15. LoRA: Low-rank adaptation of large language models. In Proc. Int. Conf. Learn. Representations.
  16. Investigation of speech landmark patterns for depression detection. IEEE transactions on affective computing, 13(2):666–679.
  17. Speech landmark bigrams for depression detection from naturalistic smartphone speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5856–5860. IEEE.
  18. Depression detection from short utterances via diverse smartphones in natural environmental conditions. In INTERSPEECH, pages 3393–3397.
  19. Evaluating the use of large language model in identifying top research questions in gastroenterology. Scientific reports, 13(1):4164.
  20. Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007.
  21. Sharlene A Liu. 1996. Landmark detection for distinctive feature-based speech recognition. The Journal of the Acoustical Society of America, 100(5):3417–3430.
  22. Gpt understands, too. AI Open.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  24. William S Noble. 2006. What is a support vector machine? Nature biotechnology, 24(12):1565–1567.
  25. Chatgpt goes to the operating room: evaluating gpt-4 performance and its potential in surgical education and training in the era of large language models. Annals of Surgical Treatment and Research, 104(5):269.
  26. Empirical analysis of the strengths and weaknesses of peft techniques for llms. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  27. Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6247–6251. IEEE.
  28. Kenneth N Stevens. 2002. Toward a model for lexical access based on acoustic landmarks and distinctive features. The Journal of the Acoustical Society of America, 111(4):1872–1891.
  29. Tensorformer: A tensor-based multimodal transformer for multimodal sentiment analysis and depression detection. IEEE Transactions on Affective Computing.
  30. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. arXiv preprint arXiv:2304.08109.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  32. The prevalence of depression in general hospital inpatients: a systematic review and meta-analysis of interview-based studies. Psychological medicine, 48(14):2285–2298.
  33. Can llms like gpt-4 outperform traditional ai tools in dementia diagnosis? maybe, but not today. arXiv preprint arXiv:2306.01499.
  34. Climate and weather: Inspecting depression detection via emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6262–6266. IEEE.
  35. Self-supervised representations in speech-based depression detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  36. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
  37. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  38. Hierarchical attention transfer networks for depression assessment from speech. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7159–7163. IEEE.
  39. Two birds with one stone: Knowledge-embedded temporal convolutional transformer for depression detection and emotion recognition. IEEE Transactions on Affective Computing.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiangyu Zhang (328 papers)
  2. Hexin Liu (35 papers)
  3. Kaishuai Xu (16 papers)
  4. Qiquan Zhang (20 papers)
  5. Daijiao Liu (3 papers)
  6. Beena Ahmed (14 papers)
  7. Julien Epps (15 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com