- The paper introduces SD-Codec, which integrates domain-specific quantizers to disentangle and encode distinct audio sources.
- It jointly learns audio coding and separation, achieving about a 1 dB SI-SDR improvement over conventional models in single-source resynthesis.
- Experimental results on extensive mixed audio datasets demonstrate competitive separation quality and enhanced resynthesis fidelity.
Learning Source Disentanglement in Neural Audio Codec
The paper presents the Source-Disentangled Neural Audio Codec (SD-Codec), an innovative approach aimed at enhancing neural audio coding by including source separation. Unlike traditional neural audio codecs, which are trained on undifferentiated audio datasets, SD-Codec introduces source disentanglement within the neural architecture, tailored distinctly for speech, music, and environmental sound effects. By doing so, SD-Codec addresses both resynthesis quality and controllability issues inherent in conventional models.
Background and Motivation
Neural audio codecs (NACs) have revolutionized audio compression by transforming continuous audio signals into discrete tokens, allowing for high-fidelity audio reconstruction at lower bit rates. Previous models, leveraging techniques like residual vector quantization (RVQ), have shown significant improvements in fidelity and bitrate efficiency. However, these models often blend audio from various domains into a single latent space, neglecting the intrinsic differences between sound sources such as speech, music, and ambient sounds. This unified treatment complicates data modeling and limits the controllability and interpretability of the latent features.
Proposed Approach: SD-Codec
SD-Codec introduces a paradigm shift by integrating source separation into the neural audio coding process. The architecture comprises domain-specific quantizers for encoding latent features of speech, music, and sound effects, which are then decoded either individually or as a mixture. Key components of SD-Codec include:
- Multiple Domain-Specific RVQs: Each source type (speech, music, sound effects) is encoded using a distinct RVQ, enabling the disentanglement of latent features corresponding to different audio domains.
- Shared Codebooks: To optimize resources and maintain performance, SD-Codec shares the last few layers of the RVQs among different sources, reflecting that shallow layers encode source-specific features while deeper layers capture local acoustic details.
- Joint Learning of Coding and Separation: The model is trained to simultaneously reconstruct and separate audio sources, enhancing its ability to handle mixed audio inputs effectively.
Experimental Results
The authors trained SD-Codec on a large-scale dataset covering over 6,000 hours of mixed audio samples, including distinct datasets for speech, music, and sound effects. Evaluation was conducted on the Divide and Remaster (DnR) dataset, representing a zero-shot scenario with unseen data mixtures.
Key findings from the experiments include:
- Audio Resynthesis: SD-Codec demonstrates superior performance in reconstructing mixed audio, achieving higher SI-SDR and comparable VisQOL scores to state-of-the-art models like DAC. For single-source resynthesis, SD-Codec outperforms DAC with an SI-SDR improvement of approximately 1 dB.
- Source Separation: Despite its primary design for audio coding, SD-Codec achieves competitive results against dedicated separation models like TDANet. The use of domain-specific RVQs ensures effective source disentanglement, reflected in similar SI-SDR and VisQOL values for separated audio tracks.
Implications and Future Directions
The introduction of SD-Codec marks a significant advance in the neural audio codec domain, offering both practical and theoretical contributions. Practically, SD-Codec's ability to disentangle and separately reconstruct different audio sources enhances applications in audio compression, streaming, and broadcasting where fidelity and controllability are paramount. Theoretically, it paves the way for more interpretable latent representations in neural audio models.
Future research directions include:
- Optimizing Codebook Sharing: Further exploration of optimal configurations for shared and non-shared quantization layers could balance the trade-off between computational efficiency and separation performance.
- Integration with Generative Models: Leveraging SD-Codec's disentangled latent space in conjunction with generative models like diffusion or LLMs could lead to more controllable and high-fidelity audio synthesis.
- Real-Time Applications: Adapting SD-Codec for real-time processing scenarios, considering computational constraints, hardware acceleration, and low-latency requirements.
In conclusion, SD-Codec offers a promising approach for enhancing neural audio codecs by incorporating source separation, achieving high-quality audio resynthesis, and setting the stage for more detailed and interpretable audio generation models. The research presented in this paper addresses fundamental limitations of existing models and opens up opportunities for future innovations in neural audio processing.