- The paper introduces neural synthesis to boost low bit rate Opus codec quality using decoded parameters.
- It compares WaveNet and LPCNet, showing LPCNet significantly improves quality at 6 kb/s with lower computational cost.
- Results indicate that neural post-processing can enhance legacy codecs while maintaining backward compatibility for real-time applications.
Improving Opus Low Bit Rate Quality with Neural Speech Synthesis
In the paper "Improving Opus Low Bit Rate Quality with Neural Speech Synthesis," the authors explore enhancement strategies for the Opus audio codec, specifically when operating at low bit rates. They propose neural generative models to address challenges associated with quality degradation in waveform matching coders like Opus when the bit rate falls below 10 kb/s.
Background and Motivation
Opus, widely adopted in applications such as Zoom and WebRTC, encompasses both linear predictive coding (SILK) and transform coding (CELT). Despite its versatility, Opus requires higher bit rates for satisfactory performance in waveform matching. As rates drop, especially below 6 kb/s, quality significantly deteriorates, favoring parametric coders instead. The authors pursue a backward-compatible approach by applying neural speech synthesis using decoded parameters to elevate the audio quality at these low rates.
Methods and Models
Two generative models are evaluated: WaveNet and LPCNet. WaveNet, known for its high-quality speech synthesis, is impractical for real-time applications due to its computational complexity and high latency. Conversely, LPCNet offers a feasible alternative with low complexity, suitable for mobile devices.
LPCNet leverages a recurrent neural network structure enhanced with linear prediction, effectively reducing complexity while maintaining reasonable synthesis quality. By incorporating frame and sample rate networks, it accommodates efficient real-time synthesis even on standard hardware.
WaveNet, although investigated for its quality, serves primarily as an informal upper bound due to its computational demands.
Evaluation and Results
LPCNet and WaveNet models were conditioned on parameters extracted from Opus bit streams at 6 kb/s. A subjective listening test, utilizing a MUSHRA-like methodology, revealed:
- LPCNet significantly outperforms the baseline Opus decoder at 6 kb/s.
- WaveNet's synthesis quality rivals that of Opus at 9 kb/s, validating the potential of neural models even at minimal bit rates.
Opus at 9 kb/s was included for comparison, reflecting on the quality improvements achievable through neural synthesis.
Discussion and Implications
The paper demonstrates that neural speech synthesis can substantially augment the quality of low bit rate speech coders without disrupting existing bit stream compatibility. Specifically, LPCNet provides a balance of quality and feasibility, making it a promising candidate for deployment in mobile environments.
The implications extend beyond Opus, suggesting potential enhancements for other standard coders like AMR-WB, ultimately prolonging their lifespan without necessitating a complete overhaul or new adoption cycles.
Future Directions
Future work could explore temporal features directly from decoded signals, potentially offering further enhancements in synthesis quality. Continuing advancements in neural vocoding may facilitate more efficient and widespread adoption across various audio compression standards.
By presenting a viable solution for low bit rate quality enhancement, this paper contributes significantly to the field of audio coding, offering practical insights into the integration of neural synthesis in existing systems.