Time-Frequency Audio Inpainting

Updated 27 January 2026

Time-frequency audio inpainting is the process of reconstructing missing or degraded spectrogram coefficients from STFT representations to restore audio quality.
Techniques range from convex structured sparsity and probabilistic NMF models to deep learning methods including U-Net and diffusion-based generative frameworks.
These approaches enhance applications in music editing, speech enhancement, and archival restoration by ensuring phase coherence, amplitude consistency, and perceptual quality.

Time-frequency audio inpainting is the problem of reconstructing missing, corrupted, or degraded coefficients in a time-frequency (TF) representation of an audio signal, such as the short-time Fourier transform (STFT) spectrogram. This problem arises in applications including music editing, bandwidth extension, transmission-loss concealment, speech enhancement, and generative restoration of archival materials. It is distinguished from time-domain inpainting by its focus on reconstructing absent columns or blocks of the complex-valued or magnitude spectrogram, under physical or perceptual constraints imposed by overlap, phase coherence, and structured audio content.

1. Problem Formulations and Mathematical Models

Let $x \in \mathbb{R}^L$ be a discrete-time signal and $X = G_g x \in \mathbb{C}^{M \times N}$ its STFT (or another invertible time-frequency transform), with window $g$ , hop size $a$ , and $M$ frequency bins, $N$ frames. A binary mask $M \in \{0,1\}^{M \times N}$ specifies which time-frequency coefficients are observed (1) or missing (0). The core inpainting objective is to estimate the (complex or magnitude) spectrogram $\hat X$ satisfying

$M \odot \hat X = M \odot X_{\rm obs}$

and, ideally, that $\hat X$ is consistent with a physically-realizable audio signal (i.e., $\hat X = G_g \hat x$ for some $\hat x$ ), while optimizing an appropriate prior or regularizer.

Two principal convex formulations dominate the structured-sparsity literature:

Synthesis model: minimize $S(z)$ (e.g., $\ell_1$ ) over coefficients $z$ , subject to $A^* z \in \Gamma$ , where $A^*$ is the synthesis operator and $\Gamma$ is the set of signals agreeing with $X_{\rm obs}$ at observed TF-coefficients.
Analysis model: minimize $S(Ax)$ subject to $x \in \Gamma$ , where $A$ is the analysis operator.

Extensions model quantized, clipped, or coarsely-coded coefficients via box constraints in time and/or transform domains (Mokrý et al., 2020). Nonconvex penalties (e.g., reweighted $\ell_1$ , $\ell_p$ ) and structured sparsity models are increasingly used.

Non-probabilistic methods further regularize via phase-aware total variation (iPCTV (Balušík et al., 26 Jan 2026)) or phase feature engineering. Probabilistic models jointly fit a generative model for the spectrogram—such as Gaussian with variances factored by nonnegative matrix factorization (NMF)—under observed-masked likelihoods, solved by EM or alternating minimization (Mokrý et al., 2022). Deep priors, classical AR models, and diffusion-based generative approaches have also been directly formulated in the time-frequency domain (Chang et al., 2019, Moliner et al., 2023, Kong et al., 20 Jan 2025).

2. Algorithmic Families for Time-Frequency Inpainting

Convex Structured Sparsity and Proximal Splitting

Standard convex frameworks leverage a redundancy-promoting analysis or synthesis transform (e.g., Gabor, STFT, ERBlet, wavelet), accentuated by a sparsity-inducing penalty (weighted or unweighted $\ell_1$ ), and enforce data consistency on observed coefficients. Algorithmic solutions include:

Douglas–Rachford or ADMM: for synthesis models, alternates soft-thresholding of coefficients with projection onto the feasible set in the time/frequency domain (Mokrý et al., 2020, Mokrý et al., 2020).
Chambolle–Pock or FISTA: for analysis models, combines soft-thresholding of the transform coefficients with projection onto feasible signal or spectrogram sets.
Approximal operators: simplified proximal mappings that avoid costly subspace constraints, offering nearly equivalent quality at reduced complexity (Mokrý et al., 2020).

Enhancement Techniques

Atom/TF weighting: Weighting the $\ell_1$ -penalty according to atom overlap with missing regions to compensate for amplitude bias (Mokrý et al., 2020).
Post-hoc time-domain compensation: Multiplying restored gap regions by empirically fitted spline curves to counteract residual energy loss (Mokrý et al., 2020).
Gradual gap filling: Stepwise inpainting with “frozen” boundaries guides structure propagation, benefiting synthesis models in long gaps.
Phase-aware TV regularization: Penalizes variation in phase-rectified spectrograms, aligned to instantaneous frequency, to maintain sinusoidal continuity (Balušík et al., 26 Jan 2026).

Probabilistic Models and NMF-based Inpainting

Probabilistic inpainting models the observed TF-coefficients as outputs of a generative process, such as Gaussian-distributed spectrograms with variances structured via NMF (Mokrý et al., 2022). EM or alternating minimization cycles between estimating the spectro-temporal factors (W, H), updating missing samples (Wiener filtering), and resynthesizing the signal via overlap-add. These methods deliver state-of-the-art performance for short to mid-duration gaps, with explicit control over prior modeling.

3. Deep Learning and Generative Methods

Convolutional and U-Net Architectures

Deep convolutional networks—adapted from image inpainting, GANs, and encoder-decoder structures—have been widely adopted for TF-inpainting:

Gated-convolutional U-Nets: Process 2D spectrogram or magnitude maps, with masking information concatenated at each layer for spatially-aware restoration (Chang et al., 2019, Kegler et al., 2019).
Residual and stacked autoencoders: Employed for reconstructing spectrograms of low-quality or compressed audio, often with both amplitude and phase channels (Deshpande et al., 2021).

Loss functions combine per-pixel losses (MSE/MAE), deep perceptual feature losses (e.g., via speechVGG), logarithmic or structural similarity losses (SSIM), and, in some setups, adversarial or perceptual components.

Diffusion and Schrödinger Bridge Frameworks

Score-based generative models and Schrödinger Bridge frameworks have recently enabled state-of-the-art time-frequency audio inpainting, especially for long and arbitrary gaps (Moliner et al., 2023, Kong et al., 20 Jan 2025). Key elements:

Zero-shot inpainting: Unconditionally trained diffusion models can be posterior-conditioned with data consistency at observed TF-bins during iterative sampling (Moliner et al., 2023).
Hybrid representations: Combining magnitude/phase decomposition with amplitude compression can stabilize training and reduce overshoot artifacts (Kong et al., 20 Jan 2025).
Multi-window "MultiDiffusion": Efficiently covers long recordings via block-wise processing and blending (Kong et al., 20 Jan 2025).

Performance metrics include log-spectral distance (LSD), objective difference grade (ODG), Fréchet audio distance (FAD), and human listening tests (MUSHRA, MOS).

Neural Deep Priors

Untrained deep neural networks, particularly CNN "deep priors," can be driven by direct optimization to match observed spectrogram regions, with early stopping for regularization. While offering flexibility, their performance and convergence can be variable and computationally expensive (Mokrý et al., 2024, Kegler et al., 2019).

4. Classical Methods: AR Modeling and Graph-based Exemplar Substitution

Janssen-TF: An extension of the time-domain autoregressive (AR) inpainting method to spectrograms, combining time-domain AR priors with exact spectrogram bin constraints, solved via ADMM. Outperforms untrained deep priors in both objective (SNR, ODG) and subjective metrics for short to moderate column gaps (Mokrý et al., 2024).
Similarity-graph exemplars: For long or structured gaps, similarity graphs on patch-wise TF features (persistent spectral similarity, instantaneous frequency) are used to identify and splice in a substitute segment from elsewhere in the audio, with STFT-domain cross-fading for smooth transitions. This non-parametric approach is particularly effective for musical signals with recurring patterns (Perraudin et al., 2016).

5. Evaluation, Benchmarking, and Practical Trade-offs

Quantitative and Subjective Measures

Metrics: Signal-to-noise ratio (SNR), log-spectral distance (LSD), perceptual objective difference grade (ODG via PEMO-Q), PESQ, STOI, and human mean opinion score (MOS, MUSHRA-style listening).
Benchmark results: Deep long audio inpainting approaches yield strong performance for gaps >200 ms (ML1 for SC09 down to 0.01086), with time-domain models frequently outperforming TF-based networks on reconstruction error but at different computational costs (Chang et al., 2019).
Ablation: Mask size vs. receptive field, loss function selection, and representation choice (raw vs. magnitude-phase) reveal critical nonlinearities in inpainting quality, e.g., networks must span at least the full masked region in effective receptive field (Chang et al., 2019, Kong et al., 20 Jan 2025).

Trade-offs and Complexity

Classical convex/proximal algorithms offer transparent convergence guarantees and flexible regularization, with the approximal operator variants achieving nearly optimal SNR at a fraction (1/10–1/100) the run time of proper prox (Mokrý et al., 2020).
Graph-based substitution is effective for long gaps but depends on available self-similarity and is computationally intensive (Perraudin et al., 2016).
Model-based (NMF, AR) and phase-aware priors provide interpretable parameterizations and efficient inference; phase-aware regularization notably reduces energy loss and improves ODG and SNR with order-of-magnitude speed improvements (Balušík et al., 26 Jan 2026).
Deep and generative methods dominate on large-scale, long-gap, and perceptual benchmarks but can entail high memory and compute requirements, with practical inference speeds ranging from ~2 to 100+ clips/sec on modern hardware (Chang et al., 2019, Kong et al., 20 Jan 2025).

6. Application Scenarios and Extensions

Time-frequency audio inpainting is central to:

Music editing and restoration: Filling in lost regions in digitized or archival recordings, particularly for music with strong repetition or structure (Perraudin et al., 2016, Kong et al., 20 Jan 2025).
Speech enhancement: Restoring speech corrupted by transient noise or missing channels, with deep feature losses attuned to intelligibility (Kegler et al., 2019).
Codec and bandwidth extension: Reconstructing missing high-frequency components, e.g., in low-bitrate compressed or transmitted audio (Deshpande et al., 2021, Kong et al., 20 Jan 2025).
General audio error concealment: Across telecommunication, streaming, and live performance domains.

Potential extensions include multi-domain or multi-modal inpainting, psychoacoustically weighted or perceptual transform weighting, hybrid systems combining time– and TF-domain data, and joint modeling of magnitude and phase via complex-valued or rotation-invariant neural layers.

7. Recent Trends and Comparative Benchmarks

Recent literature demonstrates that:

Phase-aware optimization (U-PHAIN-TF (Balušík et al., 26 Jan 2026)) surpasses both autoregressive and deep-prior methods in objective and subjective restoration for TF inpainting in instrument/source mixtures, while being 10–100× faster than deep priors.
Janssen-TF (autoregression in TF) consistently outperforms deep-prior neural networks in both SNR and ODG as well as subjective preference for column gaps up to ~75 ms (Mokrý et al., 2024).
Schrödinger Bridge and diffusion models (e.g., A2SB) enable state-of-the-art musical restoration for long gaps (up to 1.6 s), with human MOS scores ≥4 and strong note transcription F1 (Kong et al., 20 Jan 2025).
Score-based and unconditional diffusion models surpass conventional sparsity and AR methods on gaps up to 300 ms, supporting arbitrary gap sizes and genres with appropriate conditioning (Moliner et al., 2023).
Efficient autoencoder-based models enable real-time or low-latency inpainting even on long input sequences (e.g., 7.74 ms per block, SNR ≈21 dB) with quantization for edge devices (Deshpande et al., 2021).

In summary, time-frequency audio inpainting encompasses a spectrum of algorithms—classical, convex, probabilistic, and generative—each with trade-offs in complexity, interpretability, and performance. Advances in neural generative modeling and phase-aware regularization have recently pushed the state-of-the-art significantly, particularly in perceptual restoration quality, runtime efficiency, and generalization to longer gaps or more complex signal types.