- The paper introduces low-complexity OVR systems leveraging FT-JNF architecture to enhance speech clarity in noisy environments.
- It utilizes phoneme-dependent data augmentation to reduce the need for extensive device-specific training data.
- Experimental results demonstrate significant gains with PESQ 2.58, ESTOI 0.78, and LSD 1.08 in comparison to baseline models.
Low-Complexity Own Voice Reconstruction for Hearables with an In-Ear Microphone
This paper presents an investigation into the development of low-complexity Own Voice Reconstruction (OVR) systems for hearable devices equipped with in-ear microphones. The authors propose variants of a Frequency-Time Joint Non-linear Filter (FT-JNF) architecture specifically designed to operate within the computational constraints typical of hearable devices. The research addresses the essential requirement of enhancing the quality and intelligibility of own voice captured amidst environmental noise, focusing on minimizing the amount of device-specific data required for training.
Context and Objectives
The central challenge addressed in the paper is the enhancement of speech signals captured by hearable devices in noisy environments. These devices typically use an outer microphone for capturing environmental noise and an in-ear microphone, which benefits from the occlusion effect but suffers from low-frequency amplification and band-limitation issues. Prior work leveraged deep learning models for OVR, often involving high computational complexity. The objective of this research is to develop low-complexity OVR systems that achieve significant speech quality improvements while using a limited amount of device-specific data for training.
Proposed Methodologies
The paper details the design and training of several low-complexity variants of the FT-JNF architecture. This architecture employs frequency-direction and time-direction LSTM layers to process the STFT coefficients of signals from both outer and in-ear microphones. Key features of the system include:
- Phoneme-Dependent Data Augmentation: To address the scarcity of device-specific recordings, the authors adapt a phoneme-specific relative transfer function (RTF) approach for generating simulated in-ear signals from clean speech datasets. This augmentation method allows for training with minimal device-specific data.
- Multiple Complexity Variants: Five variants of the FT-JNF system (XL, L, M, S, XS) are proposed, differing in the number of hidden units in their LSTM layers. These variants trade off size and computational complexity against performance to suit varying hardware capabilities.
- Low-Complexity Solutions: The emphasis is on creating models with low computational demands in terms of the number of parameters, MACs/s, and real-time factors, making these models feasible for implementation on hearable devices.
Experimental Setup
Evaluation of the proposed systems was carried out using a dataset recorded with the Hearpiece prototype hearable device. The dataset comprised recordings from 18 talkers, divided into training, validation, and test sets. Augmented training utilized a corpus of phoneme-labeled clean speech signals from the CommonVoice dataset. Performance metrics included wideband PESQ, ESTOI, and log-spectral distance (LSD), evaluated across varying SNR levels.
Results
The proposed FT-JNF variants demonstrated substantial improvements over unprocessed signals and baseline systems. Key findings include:
- Performance: The FT-JNF XL variant achieved the highest scores, with a PESQ of 2.58, ESTOI of 0.78, and an LSD of 1.08.
- Complexity: Lower-complexity variants (M, S, XS) also performed favorably, significantly outperforming baseline models like GCBFSNet in terms of a better trade-off between performance and computational load.
- Training Data: The experiments revealed that low-complexity systems required fewer device-specific recordings to achieve comparable performance improvements, indicating their practical usability with limited training data.
Implications and Future Directions
The findings from this research support the feasibility of low-complexity OVR systems for real-world deployment in hearable devices. The successful incorporation of phoneme-dependent data augmentation and efficient filter architectures can drive future advancements in personalized and robust speech enhancement technologies. Future work might explore further reducing computational requirements through model pruning, quantization, and optimization specific to diverse hearable hardware platforms. Additionally, extending the approach to handle broader classes of noise and real-time adaptability can further enhance practical applicability.
Overall, this paper presents a significant step towards practical, low-complexity solutions for improving speech communication in challenging acoustic environments using hearable devices.