Harmonic-Plus-Noise Model Overview

Updated 23 February 2026

Harmonic-plus-noise model is a method that decomposes audio signals into periodic (harmonic) and aperiodic (noise) components to achieve interpretable and efficient synthesis.
It employs techniques such as spectral peak analysis and differentiable DSP to estimate parameters like fundamental frequency, amplitude, and spectral envelopes.
The model underpins applications in bandwidth extension, vocoding, and expressive synthesis, demonstrating real-time performance with reduced computational demands.

The harmonic-plus-noise model (HPN/HNM) is a structured approach for modeling, synthesizing, and transforming audio signals—especially speech, singing, and musical instrument sounds—by decomposing a waveform into additive harmonic (periodic) and noise (aperiodic) components. This paradigm underpins many modern neural and classical signal processing systems in bandwidth extension, source-filter vocoding, time-series decomposition, and expressive synthesis. The model’s core principle is to exploit the distinct physical and perceptual characteristics of the harmonic and noise parts, enabling more interpretable and efficient estimation, generation, and control compared to black-box models.

1. Mathematical Foundations and Signal Decomposition

At the heart of the harmonic-plus-noise model is the expression of an observed signal $s(t)$ (or $s[n]$ in discrete time) as the sum of two distinct terms: $s(t) = h(t) + n(t)$ where $h(t)$ is the harmonic component and $n(t)$ is the noise component (Grumiaux et al., 2023, Wang et al., 2015). The harmonic component is modeled as a sum of time-varying sinusoids whose frequencies track integer multiples of the fundamental frequency $f_0(t)$ : $h(t) = \sum_{k=1}^{K(t)} A_k(t) \cos\big(2\pi f_k(t)\, t + \phi_k(t)\big)$ with $f_k(t) = k f_0(t)$ , $A_k(t)$ the instantaneous amplitude, and $\phi_k(t)$ the phase of the $s[n]$ 0th harmonic. The upper limit $s[n]$ 1 or cutoff is typically set so that all harmonic energy below the instantaneous maximum voiced frequency (MVF) is included.

Above MVF, the spectrum is modeled as aperiodic, capturing fricatives, breathiness, or instrument noise via: $s[n]$ 2 where $s[n]$ 3 is white noise and $s[n]$ 4 is a time-varying filter whose frequency response matches the observed spectral envelope above the cutoff.

In data-adaptive or operator-theoretic variants (such as DAH decomposition), the model takes the form (Chekroun et al., 2017): $s[n]$ 5 where the $s[n]$ 6 are spectral operator eigenmodes and $s[n]$ 7 are time-dependent amplitudes, with orthogonal noise $s[n]$ 8 capturing all non-harmonic residual.

2. Parameter Estimation and Machine Learning Architectures

Parameter estimation proceeds framewise and entails the extraction or prediction of:

The fundamental frequency contour $s[n]$ 9 (e.g., via CREPE, autocorrelation, or specialized neural F0 estimators)
Harmonic amplitudes $s(t) = h(t) + n(t)$ 0 and phases $s(t) = h(t) + n(t)$ 1, often via spectral peak analysis or deep networks
The noise-band envelope, typically via cepstral analysis or network-predicted spectral magnitudes

Modern neural variants embed these estimators in learnable architectures. For instance, the differentiable DSP (DDSP) approach uses a small neural network (encoder–decoder) to infer $s(t) = h(t) + n(t)$ 2, $s(t) = h(t) + n(t)$ 3, and $s(t) = h(t) + n(t)$ 4 (noise filter coefficients per frequency) (Grumiaux et al., 2023). The harmonic-plus-noise source-filter models for vocoding (e.g., h-NSF, HiFTNet, HN-uSFGAN) use dilated-convolution filter stacks and explicitly merge separately synthesized periodic and aperiodic excitations (1908.10256, Li et al., 2023, Yoneyama et al., 2022).

In operator-based models, the harmonic and noise subspaces are recovered via eigen-decomposition of data-derived covariance operators, with the projections providing time-varying coefficients for subsequent stochastic modeling (Chekroun et al., 2017).

3. Adaptive and Trainable Band Separation

A critical modeling question is how to delineate the harmonic (periodic) and noise (aperiodic) bands. Early HNM systems use a fixed MVF; contemporary models make this boundary adaptive and learnable. For instance, (1908.10256) replaces fixed filters with windowed-sinc (FIR) filters whose cutoff frequency $s(t) = h(t) + n(t)$ 5 is predicted frame-by-frame by a neural network. This enables time-varying harmonic/noise separation, better matching the underlying physical/nonlinear properties of voice and instruments.

The table below summarizes band separation strategies in several representative models:

Model	Band Separation Method	Adaptivity
DDSP-HNM (Grumiaux et al., 2023)	Mean local energy / fixed harmonic count	Implicit (static)
h-NSF (1908.10256)	Trainable FIR (windowed-sinc)	Explicit, learned
HN-uSFGAN (Yoneyama et al., 2022)	Blending latent harmonic/noise features	Implicit, learnable
Classical HNM (Wang et al., 2015)	Spectral peak analysis; fixed MVF	Explicit, threshold

A plausible implication is that learned, time-varying separation consistently yields higher perceived quality and robustness across diverse voice and music signals.

4. Synthesis and Differentiable Implementation

Synthesis in HPN/HNM is performed by summing the sinusoidal (harmonic) components (as above) and filtering white noise through the time-varying noise filter for each frame or sample. In classical implementations, phase continuity and amplitude interpolation are crucial for signal naturalness (Wang et al., 2015).

Modern frameworks (e.g., DDSP (Grumiaux et al., 2023), h-NSF (1908.10256), HiFTNet (Li et al., 2023), HN-uSFGAN (Yoneyama et al., 2022)) implement all stages of analysis, synthesis, and parameter prediction in differentiable computation graphs (e.g., PyTorch or TensorFlow). This enables end-to-end gradient-based learning with objective losses formulated in the time or spectral domain. For instance, DDSP-based HNM uses a multi-scale spectral loss focused on reconstructing only the missing high-frequency bands for bandwidth extension: $s(t) = h(t) + n(t)$ 6 applied to frequency bins above the observed band (Grumiaux et al., 2023).

All synthesis operations are constructed to be fully differentiable: sinusoidal phase integration, iDFT filter generation, convolution with white noise, and framewise parameter merges all permit backpropagation. This design results in efficient models with orders-of-magnitude fewer parameters and drastically reduced inference times compared to purely data-driven deep architectures.

5. Applications in Bandwidth Extension, Vocoding, and Expressive Synthesis

The harmonic-plus-noise model is a foundational building block for a variety of audio signal processing and machine learning tasks:

Bandwidth Extension: The DDSP-based HNM reconstructs missing high frequencies in music signals, outperforming deep ResNet baselines (e.g., LSD and MUSHRA metrics) with <4k parameters and sub-real-time CPU execution (Grumiaux et al., 2023).
Neural Vocoders: Source-filter HNMs underpin modern TTS vocoders (h-NSF (1908.10256), HiFTNet (Li et al., 2023), HN-uSFGAN (Yoneyama et al., 2022)) by decoupling harmonic (pitch-synchronous) and noise (aperiodic) excitations. Objective metrics (MCD, V/UV error) and MOS listening tests demonstrate that these models attain or surpass WaveNet- and GAN-based models while offering greater interpretability and speed.
Expressive Synthesis/SVS: In Mandarin singing voice synthesis, HNM enables fine-grained, linguo-musical control by mapping extracted expression controls (pitch curve, amplitude envelope, spectral tilt, timing) directly onto the harmonic and noise parameters, achieving high perceptual naturalness and singer timbre emulation (Wang et al., 2015).
Time-Series Decomposition and Forecasting: The DAH decomposition recasts high-dimensional, multivariate time series into interpretable sums of harmonic modes plus noise, with time-evolving coefficients forecasted using multilayer Stuart–Landau SDEs, enabling data-adaptive low-dimensional modeling of complex systems (Chekroun et al., 2017).

6. Empirical Benchmarks and Objective Outcomes

Objective and subjective evaluations across diverse domains confirm the utility of the harmonic-plus-noise paradigm:

In bandwidth extension, DDSP-HNM matches or outperforms deep ResNet (55M params) using only ~4k parameters (∼1 hour training vs. 19h for ResNet; inference time ∼9% CPU real-time vs 48%) (Grumiaux et al., 2023).
In neural vocoding, HN-uSFGAN reduces mel-cepstral distortion from 3.09 dB (uSFGAN) to 2.82 dB, achieves V/UV error of 9% (vs. 12%), and improves MOS from 3.66 to 3.79 (Yoneyama et al., 2022). HiFTNet achieves MCD 2.57 dB vs. 2.82–2.93 dB (iSTFTNet, HiFi-GAN), and approaches ground-truth CMOS (Li et al., 2023).
Joint differentiable implementation enables high-throughput inference: e.g., h-NSF generates 1 s waveform in ≈3 ms (≈300k samples/sec) (1908.10256). HiFTNet achieves real-time factor 0.0057 on RTX3090Ti (≈4× faster than BigVGAN) (Li et al., 2023).
In expressive SVS, incorporating extracted expression controls into HNM achieves significant gains in perceived naturalness, clarity, and expressiveness relative to baseline speech synthesis or MIDI-only control (Wang et al., 2015).

7. Variants, Extensions, and Broader Impact

Several axes of innovation and extension are observed:

Neural Parameterization: Direct mapping of acoustic features to sinusoid and noise parameters using neural networks, including recurrent, convolutional, and hybrid source-filter GAN designs (e.g., DDSP, h-NSF, HN-uSFGAN).
Stochastic and Operator-Theoretic Decomposition: Extension of HPN/HNM to high-dimensional and stochastic systems via spectral operator analysis, yielding data-adaptive basis functions and explicit separation of oscillatory and residual noise subspaces (Chekroun et al., 2017).
End-to-End Differentiability: Full gradient flow across all synthesis and parameter estimation steps, driving the success of DDSP and contemporary neural HNM models.
Adaptive Time-Frequency Filtering: Learnable, framewise band-splitters that enable dynamic separation of periodic and aperiodic components (e.g., windowed-sinc branch merging in h-NSF).
Control and Expressiveness: Direct manipulability of timbre, pitch, and dynamics at the level of individual HNM parameters empowers voice conversion, expressive synthesis, and fine-grained audio manipulation.
Scalability and Efficiency: The hybrid HNM approach achieves state-of-the-art fidelity while reducing parameter count and computational requirements by one to two orders of magnitude compared to monolithic deep learning architectures.

A plausible implication is that, as compute and data scale further, hybrid HPN architectures integrating physical priors with neural parameterization will remain central in audio synthesis, time-series modeling, and bandwidth compression, providing uniquely interpretable and controllable representations (Grumiaux et al., 2023, Li et al., 2023, Yoneyama et al., 2022, 1908.10256, Wang et al., 2015, Chekroun et al., 2017).