Low-Complexity Own Voice Reconstruction for Hearables with an In-Ear Microphone (2409.04136v2)

Published 6 Sep 2024 in eess.AS

Abstract: Hearable devices, equipped with one or more microphones, are commonly used for speech communication. Here, we consider the scenario where a hearable is used to capture the user's own voice in a noisy environment. In this scenario, own voice reconstruction (OVR) is essential for enhancing the quality and intelligibility of the recorded noisy own voice signals. In previous work, we developed a deep learning-based OVR system, aiming to reduce the amount of device-specific recordings for training by using data augmentation with phoneme-dependent models of own voice transfer characteristics. Given the limited computational resources available on hearables, in this paper we propose low-complexity variants of an OVR system based on the FT-JNF architecture and investigate the required amount of device-specific recordings for effective data augmentation and fine-tuning. Simulation results show that the proposed OVR system considerably improves speech quality, even under constraints of low complexity and a limited amount of device-specific recordings.

Summary

The paper introduces low-complexity OVR systems leveraging FT-JNF architecture to enhance speech clarity in noisy environments.
It utilizes phoneme-dependent data augmentation to reduce the need for extensive device-specific training data.
Experimental results demonstrate significant gains with PESQ 2.58, ESTOI 0.78, and LSD 1.08 in comparison to baseline models.

Low-Complexity Own Voice Reconstruction for Hearables with an In-Ear Microphone

This paper presents an investigation into the development of low-complexity Own Voice Reconstruction (OVR) systems for hearable devices equipped with in-ear microphones. The authors propose variants of a Frequency-Time Joint Non-linear Filter (FT-JNF) architecture specifically designed to operate within the computational constraints typical of hearable devices. The research addresses the essential requirement of enhancing the quality and intelligibility of own voice captured amidst environmental noise, focusing on minimizing the amount of device-specific data required for training.

Context and Objectives

The central challenge addressed in the paper is the enhancement of speech signals captured by hearable devices in noisy environments. These devices typically use an outer microphone for capturing environmental noise and an in-ear microphone, which benefits from the occlusion effect but suffers from low-frequency amplification and band-limitation issues. Prior work leveraged deep learning models for OVR, often involving high computational complexity. The objective of this research is to develop low-complexity OVR systems that achieve significant speech quality improvements while using a limited amount of device-specific data for training.

Proposed Methodologies

The paper details the design and training of several low-complexity variants of the FT-JNF architecture. This architecture employs frequency-direction and time-direction LSTM layers to process the STFT coefficients of signals from both outer and in-ear microphones. Key features of the system include:

Phoneme-Dependent Data Augmentation: To address the scarcity of device-specific recordings, the authors adapt a phoneme-specific relative transfer function (RTF) approach for generating simulated in-ear signals from clean speech datasets. This augmentation method allows for training with minimal device-specific data.
Multiple Complexity Variants: Five variants of the FT-JNF system (XL, L, M, S, XS) are proposed, differing in the number of hidden units in their LSTM layers. These variants trade off size and computational complexity against performance to suit varying hardware capabilities.
Low-Complexity Solutions: The emphasis is on creating models with low computational demands in terms of the number of parameters, MACs/s, and real-time factors, making these models feasible for implementation on hearable devices.

Experimental Setup

Evaluation of the proposed systems was carried out using a dataset recorded with the Hearpiece prototype hearable device. The dataset comprised recordings from 18 talkers, divided into training, validation, and test sets. Augmented training utilized a corpus of phoneme-labeled clean speech signals from the CommonVoice dataset. Performance metrics included wideband PESQ, ESTOI, and log-spectral distance (LSD), evaluated across varying SNR levels.

Results

The proposed FT-JNF variants demonstrated substantial improvements over unprocessed signals and baseline systems. Key findings include:

Performance: The FT-JNF XL variant achieved the highest scores, with a PESQ of 2.58, ESTOI of 0.78, and an LSD of 1.08.
Complexity: Lower-complexity variants (M, S, XS) also performed favorably, significantly outperforming baseline models like GCBFSNet in terms of a better trade-off between performance and computational load.
Training Data: The experiments revealed that low-complexity systems required fewer device-specific recordings to achieve comparable performance improvements, indicating their practical usability with limited training data.

Implications and Future Directions

The findings from this research support the feasibility of low-complexity OVR systems for real-world deployment in hearable devices. The successful incorporation of phoneme-dependent data augmentation and efficient filter architectures can drive future advancements in personalized and robust speech enhancement technologies. Future work might explore further reducing computational requirements through model pruning, quantization, and optimization specific to diverse hearable hardware platforms. Additionally, extending the approach to handle broader classes of noise and real-time adaptability can further enhance practical applicability.

Overall, this paper presents a significant step towards practical, low-complexity solutions for improving speech communication in challenging acoustic environments using hearable devices.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArxivSound/status/1833336964185022864

https://twitter.com/AudioAndSpeech/status/1833354698071396799