FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec (2309.07405v2)

Published 14 Sep 2023 in cs.SD, cs.AI, and eess.AS

Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.

Citations (39)

View on Semantic Scholar

Summary

The paper presents FunCodec, a unified and reproducible toolkit that connects neural speech codec models with downstream ASR and TTS applications.
It offers pre-trained models and introduces FreqCodec, a frequency-domain model that achieves efficient, high-quality speech reconstruction at low token rates.
Semantic-augmented residual vector quantization enhances speech quality, highlighting its potential for practical speech processing tasks.

Overview of FunCodec: An Open-Source Toolkit for Neural Speech Codec

The paper introduces FunCodec, a sophisticated open-source toolkit designed for neural speech codec applications. Serving as an extension of the FunASR toolkit, FunCodec aims to provide a reproducible, integrable framework for developing and evaluating neural speech codec models. The toolkit supports cutting-edge models like SoundStream and Encodec, offering both training recipes and inference scripts.

Key Contributions

Unified Integration with FunASR: FunCodec's architecture allows seamless integration with downstream tasks such as automatic speech recognition (ASR) and personalized text-to-speech (TTS) synthesis. This facilitates broader applications in areas requiring speech-text modeling.
Pre-Trained Models: FunCodec offers several pre-trained models, beneficial for academia and general use, released on platforms like Huggingface and ModelScope. These models are made available to allow researchers and practitioners to leverage them for baseline comparisons and direct application in various tasks.
Frequency-Domain Model, FreqCodec: A novel contribution is the FreqCodec, a frequency-domain codec model that achieves comparable speech quality with reduced computational and parameter complexity. This innovation highlights potential advancements in speech codec efficiency.
Semantic-Augmented Residual Vector Quantization: The paper explores incorporating semantic information, such as phoneme labels, into codec models. Enhancements in quantized speech quality suggest a promising direction in reducing token rates while maintaining semantic integrity.

Experimental Insights

The experiments were conducted under both academic and generalized settings. The primary dataset for academic purposes was the LibriTTS corpus, while a large-scale in-house dataset was utilized for generalized training. The tool utilizes the Virtual Speech Quality Objective Listener (ViSQOL) score for evaluation, emphasizing high reconstruction quality at diverse token rates. Notably, FunCodec outperformed existing models like SoundStream and Encodec, particularly in lower token rate scenarios.

Semantic Augmentation and Downstream Tasks

Semantic augmentation significantly improved speech quality under low token rate conditions, with the residual approach proving most effective. Furthermore, when applied to downstream tasks like ASR, the codec-encoded tokens preserved substantial speech content, showcasing the toolkit's capability to maintain useful information through compression processes.

Implications and Future Directions

FunCodec represents a meaningful progression in open-source neural speech codec development, appealing directly to the speech processing research community. By addressing challenges in computational efficiency and integration with text modeling, it lays groundwork for future endeavors targeting broader AI applications. Potential advancements may include exploring adaptive bitrate models or more nuanced semantic incorporation strategies, fostering further improvements in neural codec technology.

Overall, FunCodec not only serves as a practical tool for researchers but also contributes conceptually to the field by proposing innovative methodologies to tackle extant neural codec challenges.

PDF Markdown

Related Papers

GitHub

GitHub - modelscope/FunCodec: FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al. (368 stars)