ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers (2404.19441v3)

Published 30 Apr 2024 in cs.SD and eess.AS

Abstract: Neural speech codecs aim to compress input signals into minimal bits while maintaining content quality in a low-latency manner. However, existing neural codecs often trade model complexity for reconstruction performance. These codecs primarily use convolutional blocks for feature transformation, which are not inherently suited for capturing the local redundancies in speech signals. To compensate, they require either adversarial discriminators or a large number of model parameters to enhance audio quality. In response to these challenges, we introduce the Efficient Speech Codec (ESC), a lightweight, parameter-efficient speech codec based on a cross-scale residual vector quantization scheme and transformers. Our model employs mirrored hierarchical window transformer blocks and performs step-wise decoding from coarse-to-fine feature representations. To enhance bitrate efficiency, we propose a novel combination of vector quantization techniques along with a pre-training paradigm. Extensive experiments demonstrate that ESC can achieve high-fidelity speech reconstruction with significantly lower model complexity, making it a promising alternative to existing convolutional audio codecs.

PDF Abstract

Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

The paper "ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers" introduces an innovative approach in the field of neural audio codecs, aiming to balance computational complexity and audio quality through a novel structural design. The proposed method, termed Efficient Speech Codec (ESC), leverages advances in vector quantization and transformer architectures to redefine state-of-the-art performance in speech coding tasks.

Key Contributions

Cross-Scale Residual Vector Quantization (CS-RVQ): The paper introduces the CS-RVQ framework, which refines neural audio coding by transmitting bitstreams across multiple resolved scales, thereby constructing a more efficient and flexible coding mechanism. By encoding multi-scale features through coarse-to-fine decoding, ESC innovatively integrates cross-scale learning without additional fusion networks.
Transformer-Based Architecture: Departing from convolutional backbones prevalent in existing audio codecs, the proposed system incorporates Swin Transformer Blocks (STBs). These enable more effective redundancy capture in audio signals, benefiting from the hierarchical attention mechanism to encapsulate both local and global spectral features.
Enhanced Training Paradigm: The research suggests an augmented training regimen including a pre-training stage aimed at counteracting codebook collapse, a common challenge in VQ networks. By temporarily deactivating vector quantizers, this pre-training phase aligns encoder representations and stabilizes subsequent codebook learning.

Experimental Evaluation and Results

The ESC system underwent rigorous testing using datasets such as DNS Challenge, LibriSpeech, and AIShell for training and evaluation. In comparison with Descript's audio codec and its configured variants, the ESC demonstrated competitive audio reconstruction quality across various objective metrics: PESQ, SI-SDR, and Mel-Distance. Notably, while its numerical superiority wasn't conclusively dominant over the original Descript's codec in all metrics, ESC did outperform smaller versions of DAC consistently, with or without adversarial losses integrated.

Further, ESC showcases remarkable efficiency in complexity and inference speed, especially on CPUs where it decisively outpaces Descript's base architecture by significant margins in both encoding and decoding tasks. Its lightweight configuration, alongside enhanced transformer blocks, offers a potential solution to the computational limitations observed in larger-scale contemporary codecs.

Theoretical and Practical Implications

The paper's methodological framework sets a precedent for future exploration into neural codecs using transformer-based architectures. The adoption of cross-scale vector quantization represents a departure from traditional single-scale quantization techniques, offering a versatile approach to balancing quality with computational efficiency. This work paves the way for more scalable, parameter-efficient models, potentially applicable to broad spectrum audio data beyond speech.

Future Directions

Among the speculative avenues for further research is the adaptation of ESC to universal audio datasets, which would require additional architectural adjustments or enhanced learning strategies to accommodate greater data variability. Moreover, the integration of end-to-end adversarial training to further refine audio quality while preserving resource efficiency remains a domain for potential exploration. This might involve scrutinizing the trade-offs between adversarial learning complexity and resulting benefits in audio fidelity.

In summary, the ESC presents a pivotal advancement in speech codec technology, aligning complexity management with high-fidelity output. Its integration of cross-scale transformers provides a compelling case for rethinking the foundational architectures employed in neural audio compression, warranting further empirical and foundational inquiry into its broader applicability and refinement.