MusicHiFi-BWE: Neural Audio Bandwidth Extension
- MusicHiFi-BWE is a high-fidelity neural audio bandwidth extension system that upscales 22.05 kHz audio to 44.1 kHz by reconstructing missing high-frequency content.
- It employs a BigVGAN-style generator with residual skip connections and multi-discriminator architectures to deliver artifact-free audio while preserving low-band fidelity.
- Empirical evaluations using metrics like Mel-Spectral Distance, STFT Distance, and ViSQOL reveal its state-of-the-art performance and exceptional real-time inference capabilities.
MusicHiFi-BWE is a high-fidelity neural audio bandwidth extension (BWE) system designed as the central module in the MusicHiFi pipeline for efficient music generation and coding. Specifically, MusicHiFi-BWE reconstructs the high-frequency band absent from bandwidth-limited inputs, delivering full-band, artifact-free audio with strong fidelity at high computational efficiency, and addresses both the limitations of conventional parametric SBR and blind neural BWE methods (Zhu et al., 2024, Choi et al., 7 Jun 2025).
1. System Role and Conceptual Pipeline
MusicHiFi-BWE operates as the upsampling and bandwidth extension stage within the three-step MusicHiFi cascade:
- MusicHiFi-V: Converts a mel-spectrogram to mono audio at 22.05 kHz.
- MusicHiFi-BWE: Upsamples mono 22.05 kHz audio to mono 44.1 kHz by restoring the missing band (11–22 kHz).
- MusicHiFi-M2S: Performs mono-to-stereo upmixing at 44.1 kHz (Zhu et al., 2024).
In MusicHiFi-BWE, the system is presented with a mono waveform sampled at 22.05 kHz and tasked with synthesizing the missing high-frequency content above 11 kHz. The objective is twofold: (i) to exactly preserve the low-band content (<11 kHz) and (ii) to generate a plausible, artifact-free high-band component resulting in a natural-sounding 44.1 kHz waveform.
The operational flow is as follows:
- The 22.05 kHz input waveform is transformed back into a 128-band log-mel-spectrogram (with twice the temporal resolution to match the 44.1 kHz target).
- A BigVGAN-style generator predicts a waveform residual corresponding to the high-frequency content.
- A fixed sinc-interpolation (band-limited upsampling) reconstructs the low-frequency band, to which the neural residual is added, yielding the full bandwidth output.
2. Generator and Discriminator Architectures
Generator Details
The generator architecture is a deep convolutional neural network, closely following the BigVGAN design (Zhu et al., 2024):
- Input: 128-band log-mel spectrogram, computed using a 512-sample Hann window and 128-sample hop rate from 22.05 kHz audio.
- Initial Projection: Either a 1×1 or 7×1 convolution projects the mel bins into a high-dimensional latent (C=2048).
- Upsampling Stack: A series of five transposed 1D convolutional layers (with increasing strides and decreasing kernel sizes), each followed by AMP (“anti-alias multi-periodicity”) blocks and Snake activation.
- Output: The final convolution reduces the hidden channels to a single waveform channel. The generator outputs only the residual component .
The output waveform construction uses a key residual formulation:
where is the original 22.05 kHz input, performs fixed band-limited upsampling to 44.1 kHz, and is the predicted high-band residual.
Discriminator Details
The discriminator stack comprises two primary families:
- Multi-Period Discriminators (MPD): Operating in the time domain, these enforce signal periodicity and artifact avoidance.
- Multi-Band, Multi-Resolution Spectrogram Discriminators (MMSD): Applied on complex STFT representations across various FFT and hop sizes, these ensure full-band spectral realism.
Both generator and discriminator architectures are shared across all stages of MusicHiFi, with only minimal modifications for BWE (notably, the addition of the residual skip connection) (Zhu et al., 2024).
3. Inductive Biases and Downsampling-Compatibility
A distinctive design objective is near downsampling-compatibility: when a native 44.1 kHz signal is downsampled to 22.05 kHz and passed through BWE, the recovered result closely approximates the original. This is principally achieved through:
- Residual Design: The generator need only synthesize high-band content. Fixed sinc upsampling gives perfect low-band preservation by definition.
- Loss Targeting: Because the discriminator applies adversarial pressure to the whole band yet the skip ensures low frequencies are exact, learning capacity can be focused on modeling high frequencies.
- Empirical Robustness: Removing the residual skip or attempting to generate the entire waveform directly leads to persistent low-frequency artifacts or loss of high-frequency fidelity (Zhu et al., 2024).
This residual, skip-based approach is critical for robust, artifact-free BWE and fast convergence.
4. Training Objectives and Loss Functions
MusicHiFi-BWE utilizes a multi-term training criterion following the DAC approach:
- Adversarial Loss: Least-squares GAN objective applied per subdiscriminator.
- Feature Matching Loss: distance over hidden layer activations between real and generated audio, promoting stability and perceptual realism.
- Multi-Resolution Mel-Spectrogram Loss: distance in log‐mel space at multiple STFT scales.
- Loss Weights: Empirically set as (feature matching), (reconstruction) (Zhu et al., 2024).
Letting be the generator and 0 the discriminators: 1
Training uses the Adam optimizer with a batch size of 45, random 16,384-sample waveform crops (20.75 s), and 500,000 total steps.
5. Objective and Subjective Evaluation
MusicHiFi-BWE is benchmarked against spectral and GAN-based BWE methods using:
- Mel-Spectral Distance (Mel-D): Lower is better.
- STFT Distance (STFT-D): Lower is better.
- ViSQOL: Perceptual quality metric (higher is better).
- Real-Time Factor (RTF): How many times faster than real time inference runs.
On DSD100 accompaniment and FMA-small music test sets (Zhu et al., 2024):
| Dataset | Method | Mel-D ↓ | STFT-D ↓ | ViSQOL ↑ | RTF |
|---|---|---|---|---|---|
| DSD100 | Aero | 0.51 | 0.12 | 4.18 | 19× |
| DSD100 | AudioSR | 1.23 | 0.51 | 3.54 | 4× |
| DSD100 | MusicHiFi-BWE | 0.55 | 0.11 | 4.14 | 1,639× |
| FMA-small | Aero | 0.89 | 0.24 | 4.12 | 19× |
| FMA-small | AudioSR | 1.68 | 0.68 | 3.25 | 4× |
| FMA-small | MusicHiFi-BWE | 1.01 | 0.26 | 4.08 | 1,613× |
MUSHRA-style listening tests (20 listeners, 6 stimuli) found MusicHiFi-BWE to be perceptually indistinguishable (3) from the Aero approach and preferred to AudioSR, confirming both competitive audio fidelity and exceptional efficiency.
6. Implementation Characteristics and Efficiency
Key implementation features include:
- Parameter Count: Approximately 46 million parameters (shared across the entire cascade).
- Inference Performance: On NVIDIA A100 GPU, the BWE stage processes audio at 41,600× real time, corresponding to approximately 0.0006 s per second of audio (Zhu et al., 2024).
- Data Regime: Trained on 1,800 hours of licensed instrumental music (mono for BWE), randomly cropped and processed.
- Preprocessing: Stereo channels averaged to mono; downsampled to 22.05 kHz before BWE input.
- No explicit data augmentation beyond downsampling is employed.
- Robustness: The skip connection provides stability and insulates the low-band from generator errors, eliminating the failure modes of naive architectures.
7. Comparative Analysis and Impact
In contrast to classic SBR and blind DNN-based BWE (Choi et al., 7 Jun 2025):
- SBR methods transmit only low-rate side information (envelopes, noise flags) and use rule-based band copying and noise filling, which fails to model complex audio timbres, especially in full-band musical material.
- Blind DNN BWE attempts high-band prediction solely from low-band input and lacks explicit encoder-side access to the missing band, resulting in substantial performance drop for difficult signals and uncorrelated spectral content.
- MusicHiFi-BWE bypasses both limitations by exploiting a carefully crafted GAN-based residual architecture, strong inductive biases, and a training/inference procedure explicitly aligned with both perceptual and practical real-time requirements.
A plausible implication is that the MusicHiFi-BWE architecture—particularly the residual skip paradigm with universal generator/discriminator sharing—may motivate further unification of bandwidth extension, vocoding, and source separation pipelines under a common adversarial neural backbone.
References
- (Zhu et al., 2024) MusicHiFi: Fast High-Fidelity Stereo Vocoding
- (Choi et al., 7 Jun 2025) Neural Spectral Band Generation for Audio Coding