- The paper introduces TokenSE, the first SE framework for cochlear implants that leverages a Mamba state-space model for efficient, real-time discrete token prediction.
- It integrates joint fine-tuning of the neural audio codec encoder with residual vector quantization to optimize token representations for degraded speech conditions.
- Experimental results show significant gains in speech quality, intelligibility, and computational efficiency compared to Transformer-based and traditional SE methods.
TokenSE: Mamba-Based Discrete Token Speech Enhancement for Cochlear Implants
Motivation and Context
Speech enhancement (SE) is a critical challenge in the context of cochlear implants (CIs), where both additive noise and reverberation drastically impact intelligibility and perceived quality. Traditional signal-processing and even many supervised deep learning (DL) methods often fail to generalize robustly across diverse acoustic conditions or introduce distortions detrimental to CI users, who have limited spectro-temporal resolution. While recent approaches have explored discrete-token speech modeling in the neural audio codec (NAC) space, the needs of CI recipients have remained largely unaddressed in such frameworks. This paper introduces TokenSE, the first discrete token-based SE framework explicitly designed for cochlear implant users and leveraging a Mamba-based sequence model for codec token prediction (2604.12246).
Methodological Framework
Mamba-Based Sequence Modeling
At the core of TokenSE lies the Mamba model, a state-space model (SSM) that converts the traditionally time-invariant SSM to a selective, input-dependent, time-varying process. Mamba achieves linear computational complexity with respect to sequence length, in contrast to the Transformer’s quadratic complexity. It introduces parameterizations of state transition, input, and output matrices as functions of the input sequence, allowing dynamic and selective sequence processing that efficiently captures long-range dependencies—a crucial aspect for modeling speech temporal structure.
Three variants are explored:
- Mamba (Uni): Unidirectional, causal variant for real-time operation.
- Mamba (Bi): Bidirectional model for leveraging both past and future context.
- Transformer-MHSA+Mamba (Bi): Hybrid, replacing MHSA with bidirectional Mamba in Transformer layers.
Discrete Token-Based Enhancement in Neural Audio Codec Space
TokenSE operates in the discrete token space produced by the Encodec NAC backbone. The encoder compresses the input waveform to latent embeddings, which are quantized via residual vector quantization (RVQ) into token sequences. The Mamba-based model predicts clean codec tokens given degraded input embeddings. The decoder reconstructs the enhanced waveform from the predicted tokens. Uniquely, the framework jointly fine-tunes the encoder with the Mamba backbone, adapting token representations specifically for SE under degraded conditions, unlike previous works which typically freeze the encoder.
The optimization objective combines cross-entropy loss over token indices and ℓ2​ loss over codebook entries corresponding to predicted tokens, aligning both token prediction accuracy and the fidelity of reconstructed audio.
Experimental Evaluation
Objective In-Domain and Out-of-Domain Performance
Comprehensive experiments are conducted on both in-domain (DNS Challenge) and out-of-domain (OOD; TIMIT + NOISEX-92 + REVERB) datasets. Performance is measured using DNSMOS P.835, a three-dimensional, non-intrusive perceptual metric suite.
Key numerical findings:
- In-domain: TokenSE (Mamba (Bi)) outperforms all baselines—including Log-MMSE, DEMUCS, FRCRN, SELM, and MaskSR—in terms of SIG (speech quality), BAK (background noise), and OVL (overall quality) across all test conditions, including reverberant and real recordings.
- For instance, on the DNS set with reverb, TokenSE achieves SIG = 3.63, BAK = 4.16, OVL = 3.39 compared to the best generative baseline MaskSR (SIG = 3.53, BAK = 4.07, OVL = 3.25) and transformer-based TokenSE (SIG = 3.58, BAK = 4.08, OVL = 3.31).
- OOD generalization: TokenSE consistently surpasses the Log-MMSE baseline for both noisy-only and reverberant + noisy conditions at all SNRs and reverberation times, with especially large margins in the more adverse scenarios (e.g., at 0 dB SNR and T60 = 0.7s).
Subjective Evaluation with CI Users
Significant intelligibility improvements are observed in formal listening tests with six CI recipients:
- Under 0 dB SNR (noisy-only), TokenSE yields a 47.19 percentage point gain in mean word recognition rate (WRR) over unprocessed, and a statistically significant enhancement relative to Log-MMSE (p=0.026).
- For reverberant + noisy conditions (T60​=0.5\,s), TokenSE delivers WRR gains of 38.40 and 38.41 percentage points over unprocessed for T60​ values of 0.5s and 0.7s, respectively, both statistically significant.
- Across all scenarios, subjective mean opinion scores (MOS) for speech quality are maximized by TokenSE with statistical significance in most tested conditions.
Ablation Studies and Efficiency
Experiments confirm that fine-tuning the NAC encoder in TokenSE yields superior enhancement compared to encoder-freezing strategies even when auxiliary features are used. TokenSE with Mamba (Bi) achieves lower GFLOPs than Transformer-based counterparts, confirming the practical efficiency critical for deployment in CI/HA systems.
Theoretical and Practical Implications
The development of TokenSE operationalizes several advances:
- Adaptation of SSM-based Mamba for generative SE in the discrete token space, achieving superior sequence modeling at lower computational cost.
- End-to-end, jointly-optimized compression and enhancement: By fine-tuning the encoder, token representations are specialized for SE, optimizing information flow for intelligibility restoration.
- Explicit targeting of the CI population: Objective and subjective results specifically validate gains for CI users, a demographic rarely considered in token-based SE frameworks.
The implications are twofold:
- Practically, TokenSE paves an efficient path for real-time, high-performance SE on CI and hearing aid hardware without the compute penalties of Transformer architectures.
- Theoretically, the approach opens avenues for future integration of more sophisticated generative modeling in token spaces (e.g., diffusion or flow-based methods), exploration of causal deployment, and tailored enhancement strategies preserving CI-perceivable cues.
Conclusion
TokenSE introduces a robust and computationally efficient discrete-token-based speech enhancement framework using Mamba state-space models, tailored for the unique demands of cochlear implant users. The system demonstrates substantial gains in both objective and listener-based outcomes under challenging real-world acoustic conditions, setting a new standard in CI-oriented speech enhancement research. Future investigations may extend these principles to other assistive hearing devices and further optimize discrete codec representations in end-to-end SE pipelines.