MusicHiFi: A New Frontier in High-Fidelity Stereo Vocoding
Introduction
The generation of high-quality audio through advanced vocoding techniques remains a significant challenge in the field of music generation and audio processing. Despite the advancements, existing methods often produce monophonic audio at lower resolutions, which restricts their application potential. Addressing this gap, the introduction of MusicHiFi, an efficient high-fidelity stereophonic vocoder, marks a significant stride toward achieving superior audio quality. Using a cascade of three generative adversarial networks (GANs), MusicHiFi transforms low-resolution mel-spectrograms into high-fidelity stereophonic audio. Its architecture ensures fast inference speeds, better audio quality, and enhanced spatialization control compared to previous methods.
Methodology
MusicHiFi employs a unified approach across its three stages: vocoding, bandwidth extension (BWE), and mono-to-stereo upmixing (M2S). Each stage utilizes a GAN-based generator and discriminator architecture, with adaptations to meet the specific requirements of each task.
- Vocoding (MusicHiFi-V): Converts low-resolution mel-spectrograms into audio waveforms, adhering to a unified GAN-based architecture for generation.
- Bandwidth Extension (MusicHiFi-BWE): Transforms low-resolution audio to high-resolution outputs. Incorporates a residual connection and an upsampling step, allowing the module to focus on generating high-frequency content effectively.
- Mono-to-Stereo Upmixing (MusicHiFi-M2S): Utilizes mid-side encoding to produce stereo audio from mono inputs. This approach not only preserves the original monophonic content but also facilitates superior control over the spatial width of the audio.
Experiment and Results
MusicHiFi was rigorously evaluated against standard benchmarks and baselines. In terms of vocoding, it demonstrated superior performance on key metrics like Mel-D, STFT-D, and ViSQOL, maintaining comparable performance on SI-SDR with significantly faster inference speeds. The BWE module showed equivalent or better performance with Aero, while significantly outperforming AudioSR. Notably, MusicHiFi was hundreds of times faster than the baseline models. The M2S module outperformed conventional DSP-based decorrelation methods in objective assessments, proving the method’s efficiency and efficacy in creating high-quality stereo audio.
Implications and Future Directions
MusicHiFi represents a breakthrough in stereo vocoding, offering an efficient, high-quality solution for audio and music generation tasks. Its design addresses the key challenges in the field, including speed of generation, quality of the audio, and spatialization control. Looking ahead, the potential applications of MusicHiFi are vast. The model can be integrated with mel-spectrogram-based music generators, enhance the fidelity of low-resolution recordings, or be used to spatialize monophonic music. Furthermore, the unified GAN-based architecture offers a robust framework that could inspire future developments in audio processing and generative modeling.
Conclusion
The advent of MusicHiFi opens new avenues in the generation of high-fidelity, stereophonic audio. By leveraging a cascaded GAN approach, MusicHiFi efficiently transforms low-resolution mel-spectrograms into high-quality stereophonic audio. Its architecture ensures superiority in audio quality, spatialization, and inference speed over existing methods. The successful implementation and validation of MusicHiFi not only underscore its potential for immediate applications but also set the stage for future innovations in audio and music generation.