Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HeAR -- Health Acoustic Representations (2403.02522v1)

Published 4 Mar 2024 in cs.LG and cs.AI

Abstract: Health acoustic sounds such as coughs and breaths are known to contain useful health signals with significant potential for monitoring health and disease, yet are underexplored in the medical machine learning community. The existing deep learning systems for health acoustics are often narrowly trained and evaluated on a single task, which is limited by data and may hinder generalization to other tasks. To mitigate these gaps, we develop HeAR, a scalable self-supervised learning-based deep learning system using masked autoencoders trained on a large dataset of 313 million two-second long audio clips. Through linear probes, we establish HeAR as a state-of-the-art health audio embedding model on a benchmark of 33 health acoustic tasks across 6 datasets. By introducing this work, we hope to enable and accelerate further health acoustics research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Flusense: a contactless syndromic surveillance platform for influenza-like illness in hospital waiting areas. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–28, 2020.
  2. Cough sound detection and diagnosis using artificial intelligence techniques: challenges and opportunities. Ieee Access, 9:102327–102344, 2021.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  4. Can machine learning be used to recognize and diagnose coughs? In 2020 International Conference on e-Health and Bioengineering (EHB), pages 1–4. IEEE, 2020.
  5. A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
  6. Multimodal llms for health grounded in individual-specific data. arXiv preprint arXiv:2307.09018, 2023.
  7. Coswara: A respiratory sounds and symptoms dataset for remote screening of sars-cov-2 infection. Scientific Data, 10(1):397, 2023.
  8. Connected speech in neurodegenerative language disorders: a review. Frontiers in psychology, 8:269, 2017.
  9. Detection of tuberculosis by automatic cough sound analysis. Physiological measurement, 39(4):045005, 2018.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  11. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pages 3915–3924. PMLR, 2022.
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  13. End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study. BMJ innovations, 7(2), 2021.
  14. Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297, 2022.
  15. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  18. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  19. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
  20. Jake Garrison. Spiro AI: Smartphone Based Pulmonary Function Testing. PhD thesis, 2018.
  21. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  22. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  23. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  24. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  25. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  26. Masked autoencoders that listen. arXiv preprint arXiv:2207.06405, 2022.
  27. Slow-fast auditory streams for audio recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 855–859. IEEE, 2021.
  28. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
  29. Arne Köhn. What’s in an embedding? analyzing word embeddings through multilingual evaluation. EMNLP, 2015.
  30. Covid-19 artificial intelligence diagnosis using only cough recordings. IEEE Open Journal of Engineering in Medicine and Biology, 1:275–281, 2020.
  31. Validation of an automated cough detection algorithm for tracking recovery of pulmonary tuberculosis patients. 2012.
  32. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  33. The coughvid crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Scientific Data, 8(1):156, 2021.
  34. Automatic cough classification for tuberculosis screening in a real-world environment. Physiological Measurement, 42(10):105014, 2021.
  35. Frill: A non-semantic speech embedding for mobile devices. arXiv preprint arXiv:2011.04609, 2020.
  36. A cough-based algorithm for automatic diagnosis of pertussis. PloS one, 11(9):e0162128, 2016.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  39. Cough sound analysis and objective correlation with spirometry and clinical diagnosis. Informatics in Medicine Unlocked, 19:100319, 2020.
  40. Detecting covid-19 from breathing and coughing sounds using deep neural networks. arXiv preprint arXiv:2012.14553, 2020.
  41. Tbscreen: A passive cough classifier for tuberculosis screening with a controlled dataset. Science Advances, 10(1):eadi0282, 2024.
  42. Trillsson: Distilled universal paralinguistic speech representations. arXiv preprint arXiv:2203.00236, 2022.
  43. Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764, 2020.
  44. Universal paralinguistic speech representations using self-supervised conformers. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3169–3173. IEEE, 2022.
  45. Large language models encode clinical knowledge. Nature, pages 1–9, 2023.
  46. Conformer-based self-supervised learning for non-speech audio tasks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8862–8866. IEEE, 2022.
  47. Cough detection algorithm for monitoring patient recovery from pulmonary tuberculosis. In 2011 Annual international conference of the IEEE engineering in medicine and biology society, pages 6017–6020. IEEE, 2011.
  48. Towards learning universal audio representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4593–4597. IEEE, 2022.
  49. Trainable frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5670–5674. IEEE, 2017.
  50. An intentional approach to managing bias in general purpose embedding models. The Lancet Digital Health, 6(2):e126–e130, 2024.
  51. Whosecough: In-the-wild cougher verification using multitask learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 896–900. IEEE, 2020.
  52. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
  53. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  54. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing, 16(6):1519–1532, 2022.
  55. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023.
  56. Making cough count in tuberculosis care. Communications medicine, 2(1):83, 2022.
Citations (6)

Summary

  • The paper introduces a self-supervised masked autoencoder framework that robustly analyzes respiratory sounds for health monitoring.
  • It leverages 313 million audio clips to generalize across nine health tasks, outperforming benchmarks in cough and spirometry analysis.
  • The study demonstrates scalable data efficiency and sets the stage for advanced non-invasive diagnostic tools in respiratory care.

An Evaluation of HeAR: Health Acoustic Representations for Machine Learning Applications in Health Monitoring

The paper introduces HeAR, a novel self-supervised learning-based deep learning framework aimed at advancing the field of health acoustics by analyzing non-semantic respiratory sounds for health monitoring and disease detection. HeAR is developed to address limitations in current machine learning systems that are often narrowly trained and hence exhibit poor generalization across various health tasks. This approach leverages a scalable self-supervised learning method using masked autoencoders trained on a substantial dataset of acoustic health signals—specifically, 313 million two-second-long audio clips, highlighting its potential to enrich the domain of health acoustics.

Methodology and Objectives

The framework is composed of multiple components: a health acoustic event detector, an audio encoder based on masked autoencoders, and a task-specific evaluation module for various health acoustic tasks. The training of the audio encoder is supported by a large, unlabeled dataset harvested from non-copyrighted content on YouTube. This scale of data is expected to foster better generalization across nine health-related acoustic tasks, which include health acoustic event detection, cough-based disease/condition inference, and spirometry measurements.

The chosen self-supervised learning architecture, inspired by MAEs, aims to learn acoustics-based representations that are both robust and transferable across tasks. These representations are benchmarked against established systems such as TRILL, FRILL, and BigSSL-CAP12 among others. A diverse benchmark of 33 tasks across multiple datasets, including FSD50K, FluSense, and proprietary datasets from CIDRZ in Zambia, is employed to establish the efficacy and superiority of HeAR.

Results and Key Findings

HeAR's performance is notably superior in several tasks, achieving the best results in 17 of the 33 tasks evaluated, specifically shining in the domains of cough inference and spirometry estimation. HeAR demonstrates robust performance in generalized classification of respiratory diseases, such as tuberculosis, using cough audio data, and accurate spirometry estimations in COPD patient monitoring scenarios. In scenarios requiring cross-device evaluation, HeAR maintains consistent and high performance, underscoring its potential utility in real-world applications where varied audio recording equipment would be used.

The significance of training data size is also evident in this work; as the training data pool increases, the performance and robustness of the audio encoder improve, demonstrating data scalability positively influences results. Moreover, through high mean reciprocal rank scores, HeAR exhibits exceptional data efficiency, maintaining high accuracy even when trained on significantly reduced datasets.

Implications and Future Directions

This paper demonstrates the potential of large-scale self-supervised learning architectures in the field of health acoustics, especially in underserved domains such as respiratory health monitoring and diagnostics. Although the research focuses on using linear probes, future work could investigate fine-tuning the entire model to optimize performance further. Additionally, issues such as performance generalization, demographic biases, and clinically relevant thresholds require further exploration through exhaustive clinical trials and validations before HeAR can be fully integrated into healthcare systems.

Looking ahead, the research paves the way for comprehensive studies in health acoustics, improving upon foundational technologies like HeAR. Innovations such as model distillation or quantization may further optimize these encoders for real-time processing on mobile platforms, which is crucial for their deployment in under-resourced settings. With continued advancements, such systems could potentially aid healthcare practitioners worldwide, particularly in regions where respiratory diseases pose significant public health challenges, providing an efficient, cost-effective, and non-invasive diagnostic tool.

Reddit Logo Streamline Icon: https://streamlinehq.com