HeAR -- Health Acoustic Representations (2403.02522v1)
Abstract: Health acoustic sounds such as coughs and breaths are known to contain useful health signals with significant potential for monitoring health and disease, yet are underexplored in the medical machine learning community. The existing deep learning systems for health acoustics are often narrowly trained and evaluated on a single task, which is limited by data and may hinder generalization to other tasks. To mitigate these gaps, we develop HeAR, a scalable self-supervised learning-based deep learning system using masked autoencoders trained on a large dataset of 313 million two-second long audio clips. Through linear probes, we establish HeAR as a state-of-the-art health audio embedding model on a benchmark of 33 health acoustic tasks across 6 datasets. By introducing this work, we hope to enable and accelerate further health acoustics research.
- Flusense: a contactless syndromic surveillance platform for influenza-like illness in hospital waiting areas. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–28, 2020.
- Cough sound detection and diagnosis using artificial intelligence techniques: challenges and opportunities. Ieee Access, 9:102327–102344, 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Can machine learning be used to recognize and diagnose coughs? In 2020 International Conference on e-Health and Bioengineering (EHB), pages 1–4. IEEE, 2020.
- A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
- Multimodal llms for health grounded in individual-specific data. arXiv preprint arXiv:2307.09018, 2023.
- Coswara: A respiratory sounds and symptoms dataset for remote screening of sars-cov-2 infection. Scientific Data, 10(1):397, 2023.
- Connected speech in neurodegenerative language disorders: a review. Frontiers in psychology, 8:269, 2017.
- Detection of tuberculosis by automatic cough sound analysis. Physiological measurement, 39(4):045005, 2018.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pages 3915–3924. PMLR, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study. BMJ innovations, 7(2), 2021.
- Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297, 2022.
- Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
- Jake Garrison. Spiro AI: Smartphone Based Pulmonary Function Testing. PhD thesis, 2018.
- A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Masked autoencoders that listen. arXiv preprint arXiv:2207.06405, 2022.
- Slow-fast auditory streams for audio recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 855–859. IEEE, 2021.
- The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Arne Köhn. What’s in an embedding? analyzing word embeddings through multilingual evaluation. EMNLP, 2015.
- Covid-19 artificial intelligence diagnosis using only cough recordings. IEEE Open Journal of Engineering in Medicine and Biology, 1:275–281, 2020.
- Validation of an automated cough detection algorithm for tracking recovery of pulmonary tuberculosis patients. 2012.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- The coughvid crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Scientific Data, 8(1):156, 2021.
- Automatic cough classification for tuberculosis screening in a real-world environment. Physiological Measurement, 42(10):105014, 2021.
- Frill: A non-semantic speech embedding for mobile devices. arXiv preprint arXiv:2011.04609, 2020.
- A cough-based algorithm for automatic diagnosis of pertussis. PloS one, 11(9):e0162128, 2016.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Cough sound analysis and objective correlation with spirometry and clinical diagnosis. Informatics in Medicine Unlocked, 19:100319, 2020.
- Detecting covid-19 from breathing and coughing sounds using deep neural networks. arXiv preprint arXiv:2012.14553, 2020.
- Tbscreen: A passive cough classifier for tuberculosis screening with a controlled dataset. Science Advances, 10(1):eadi0282, 2024.
- Trillsson: Distilled universal paralinguistic speech representations. arXiv preprint arXiv:2203.00236, 2022.
- Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764, 2020.
- Universal paralinguistic speech representations using self-supervised conformers. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3169–3173. IEEE, 2022.
- Large language models encode clinical knowledge. Nature, pages 1–9, 2023.
- Conformer-based self-supervised learning for non-speech audio tasks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8862–8866. IEEE, 2022.
- Cough detection algorithm for monitoring patient recovery from pulmonary tuberculosis. In 2011 Annual international conference of the IEEE engineering in medicine and biology society, pages 6017–6020. IEEE, 2011.
- Towards learning universal audio representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4593–4597. IEEE, 2022.
- Trainable frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5670–5674. IEEE, 2017.
- An intentional approach to managing bias in general purpose embedding models. The Lancet Digital Health, 6(2):e126–e130, 2024.
- Whosecough: In-the-wild cougher verification using multitask learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 896–900. IEEE, 2020.
- Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing, 16(6):1519–1532, 2022.
- Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023.
- Making cough count in tuberculosis care. Communications medicine, 2(1):83, 2022.