Improving Opus Low Bit Rate Quality with Neural Speech Synthesis (1905.04628v3)

Published 12 May 2019 in eess.AS and cs.SD

Abstract: The voice mode of the Opus audio coder can compress wideband speech at bit rates ranging from 6 kb/s to 40 kb/s. However, Opus is at its core a waveform matching coder, and as the rate drops below 10 kb/s, quality degrades quickly. As the rate reduces even further, parametric coders tend to perform better than waveform coders. In this paper we propose a backward-compatible way of improving low bit rate Opus quality by re-synthesizing speech from the decoded parameters. We compare two different neural generative models, WaveNet and LPCNet. WaveNet is a powerful, high-complexity, and high-latency architecture that is not feasible for a practical system, yet provides a best known achievable quality with generative models. LPCNet is a low-complexity, low-latency RNN-based generative model, and practically implementable on mobile phones. We apply these systems with parameters from Opus coded at 6 kb/s as conditioning features for the generative models. A listening test shows that for the same 6 kb/s Opus bit stream, synthesized speech using LPCNet clearly outperforms the output of the standard Opus decoder. This opens up ways to improve the decoding quality of existing speech and audio waveform coders without breaking compatibility.

Citations (37)

View on Semantic Scholar

Summary

The paper introduces neural synthesis to boost low bit rate Opus codec quality using decoded parameters.
It compares WaveNet and LPCNet, showing LPCNet significantly improves quality at 6 kb/s with lower computational cost.
Results indicate that neural post-processing can enhance legacy codecs while maintaining backward compatibility for real-time applications.

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

In the paper "Improving Opus Low Bit Rate Quality with Neural Speech Synthesis," the authors explore enhancement strategies for the Opus audio codec, specifically when operating at low bit rates. They propose neural generative models to address challenges associated with quality degradation in waveform matching coders like Opus when the bit rate falls below 10 kb/s.

Background and Motivation

Opus, widely adopted in applications such as Zoom and WebRTC, encompasses both linear predictive coding (SILK) and transform coding (CELT). Despite its versatility, Opus requires higher bit rates for satisfactory performance in waveform matching. As rates drop, especially below 6 kb/s, quality significantly deteriorates, favoring parametric coders instead. The authors pursue a backward-compatible approach by applying neural speech synthesis using decoded parameters to elevate the audio quality at these low rates.

Methods and Models

Two generative models are evaluated: WaveNet and LPCNet. WaveNet, known for its high-quality speech synthesis, is impractical for real-time applications due to its computational complexity and high latency. Conversely, LPCNet offers a feasible alternative with low complexity, suitable for mobile devices.

LPCNet leverages a recurrent neural network structure enhanced with linear prediction, effectively reducing complexity while maintaining reasonable synthesis quality. By incorporating frame and sample rate networks, it accommodates efficient real-time synthesis even on standard hardware.

WaveNet, although investigated for its quality, serves primarily as an informal upper bound due to its computational demands.

Evaluation and Results

LPCNet and WaveNet models were conditioned on parameters extracted from Opus bit streams at 6 kb/s. A subjective listening test, utilizing a MUSHRA-like methodology, revealed:

LPCNet significantly outperforms the baseline Opus decoder at 6 kb/s.
WaveNet's synthesis quality rivals that of Opus at 9 kb/s, validating the potential of neural models even at minimal bit rates.

Opus at 9 kb/s was included for comparison, reflecting on the quality improvements achievable through neural synthesis.

Discussion and Implications

The paper demonstrates that neural speech synthesis can substantially augment the quality of low bit rate speech coders without disrupting existing bit stream compatibility. Specifically, LPCNet provides a balance of quality and feasibility, making it a promising candidate for deployment in mobile environments.

The implications extend beyond Opus, suggesting potential enhancements for other standard coders like AMR-WB, ultimately prolonging their lifespan without necessitating a complete overhaul or new adoption cycles.

Future Directions

Future work could explore temporal features directly from decoded signals, potentially offering further enhancements in synthesis quality. Continuing advancements in neural vocoding may facilitate more efficient and widespread adoption across various audio compression standards.

By presenting a viable solution for low bit rate quality enhancement, this paper contributes significantly to the field of audio coding, offering practical insights into the integration of neural synthesis in existing systems.

PDF Markdown

Related Papers

GitHub

GitHub - xiph/LPCNet: Efficient neural speech synthesis (1,178 stars)

Tweets

https://twitter.com/mardabx/status/1760435090193043917