A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication (2504.06561v1)

Published 9 Apr 2025 in cs.SD

Abstract: This paper proposes StreamCodec, a streamable neural audio codec designed for real-time communication. StreamCodec adopts a fully causal, symmetric encoder-decoder structure and operates in the modified discrete cosine transform (MDCT) domain, aiming for low-latency inference and real-time efficient generation. To improve codebook utilization efficiency and compensate for the audio quality loss caused by structural causality, StreamCodec introduces a novel residual scalar-vector quantizer (RSVQ). The RSVQ sequentially connects scalar quantizers and improved vector quantizers in a residual manner, constructing coarse audio contours and refining acoustic details, respectively. Experimental results confirm that the proposed StreamCodec achieves decoded audio quality comparable to advanced non-streamable neural audio codecs. Specifically, on the 16 kHz LibriTTS dataset, StreamCodec attains a ViSQOL score of 4.30 at 1.5 kbps. It has a fixed latency of only 20 ms and achieves a generation speed nearly 20 times real-time on a CPU, with a lightweight model size of just 7M parameters, making it highly suitable for real-time communication applications.

Authors (4)

Xiao-Hang Jiang (6 papers)
Yang Ai (41 papers)
Rui-Chen Zheng (9 papers)
Zhen-Hua Ling (114 papers)

Summary

A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication

The paper presents an innovative approach to neural audio coding through StreamCodec, engineered to facilitate real-time communication. Utilizing a causal encoder-decoder architecture alongside a Modified Discrete Cosine Transform (MDCT) domain, StreamCodec is designed for low-latency operations and efficient real-time audio generation. It distinguishes itself from existing solutions through the introduction of a Residual Scalar-Vector Quantizer (RSVQ), a hybrid quantization strategy that sequentially combines scalar and vector quantizers in a hierarchical, residual manner. This approach is shown to proficiently navigate codebook utilization while refining acoustic details.

Key Findings and Claims

The paper demonstrates that StreamCodec achieves audio quality on par with advanced non-streamable neural audio codecs despite its streamable nature. Notably, experimental results illustrate its exemplary performance on the 16 kHz LibriTTS dataset, achieving a ViSQOL score of 4.30 at 1.5 kbps. Importantly, this codec maintains a fixed latency of only 20 ms and operates at a generation speed nearly 20 times real-time on a CPU with a lightweight model of 7M parameters, indicating significant practical advantages for real-time applications.

Technical Contributions

The paper's technical contribution lies substantially in its RSVQ design. Unlike traditional Residual Vector Quantization (RVQ) that often suffers from codebook collapse, RSVQ incorporates scalar quantization for coarse representation of audio followed by improved vector quantizers for fine detail. This integration significantly elevates the codebook utilization rate, preventing the wastage of codevectors and enhancing coding quality. Experiments reveal complete codebook utilization across quantizers, corroborating the efficacy of RSVQ.

Practical and Theoretical Implications

From a theoretical perspective, this paper advances the understanding of causal inference structures in neural codecs by demonstrating that causality need not compromise quality. This development holds promise for applications necessitating real-time operations, such as live audio streaming, telecommunication, and virtual reality experiences. Practically, StreamCodec's low computational footprint and latency position it as a highly viable choice for systems constrained by hardware capabilities or requiring large-scale deployment.

Future Directions

The practical success of StreamCodec and its component RSVQ suggests several avenues for future research. Pursuing ultra-low bitrates or extending the RSVQ for multidimensional quantization could further compress data without loss of fidelity. Additionally, exploring integration with other neural network architectures and different quantization methods could unlock new efficiencies or improve generalization to various audio environments.

In summary, the paper presents a compelling enhancement to neural audio coding methodologies, emphasizing causality and efficient quantization. StreamCodec stands out as a robust candidate for advancing real-time codec technologies and sets a foundation for exploring sophisticated, low-latency audio processing.

Related Papers

Find Related Papers

YouTube

Show All Videos