Mix2Morph: Text-to-Audio Sound Morphing

Updated 4 February 2026

Mix2Morph is a text-to-audio diffusion model that achieves sound morphing by infusing a secondary source's timbral and textural attributes into a primary source's structural framework.
It leverages a pretrained latent diffusion transformer with a VAE encoder, diffusion U-Net, and high-timestep surrogate mixes to generate perceptually coherent audio outputs.
Empirical evaluations using objective metrics and subjective tests demonstrate that Mix2Morph outperforms conventional methods, setting a new state-of-the-art for controllable, concept-driven sound morphing.

Mix2Morph is a text-to-audio diffusion model designed to achieve sound morphing—specifically, sound infusion—without reliance on a dedicated morph dataset. By fine-tuning on “noisy surrogate mixes” injected only at later diffusion timesteps, the model generates stable, perceptually coherent audio morphs that integrate the salient qualities of two distinct source sounds. The dominant source supplies the structural and temporal characteristics, while the secondary source is “infused” for timbral and textural enrichment. Empirical results demonstrate that Mix2Morph sets a new state of the art for controllable, concept-driven sound morphing across a diverse set of audio categories (Chu et al., 28 Jan 2026).

1. Architectural Foundations and Model Specification

Mix2Morph is built on a pretrained latent-diffusion transformer backbone for single-sound text-to-audio generation, consistent with prior work on large-scale text-conditioned audio synthesis (e.g., [Evans et al. 2024], [Garcia et al. 2025]). The model stack comprises:

A VAE encoder that maps raw 48 kHz stereo waveforms to a 256-dimensional latent sequence at 40 Hz. Reconstructions pass through a VAE decoder.
A diffusion U-Net that operates on latent representations, conditioned via transformer cross-attention on text embeddings encoding semantic prompts.
The diffusion process is modeled both in continuous-time SDE form,

$d\mathbf z_t = f(\mathbf z_t, t)\,dt + g(t)\,d\mathbf w_t$

and as discrete DDPM style steps:

$\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$

The reverse (denoising) process is parameterized as

$p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$

The model is optimized by minimizing the standard L2 denoising objective on latents:

$L(\theta)=\mathbb E_{t,\mathbf z_0,\epsilon}\|\epsilon - \epsilon_\theta(\mathbf z_t, t, \text{text})\|^2$

Fine-tuning is performed for 50,000 steps exclusively on surrogate mixes for high diffusion timesteps ( $t\in[0.5,1]$ ), preserving single-sound reconstruction learning for low $t$ .

A set of four “augmentation modes” (RMS, spectral, both, none) is introduced, with each mode encoded by distinct prompt tokens to cue the model regarding the expected sound fusion behavior.

2. Noisy Surrogate Mixes and Fine-tuning Protocol

The Mix2Morph training strategy circumvents the need for explicit morph datasets through the construction of “noisy surrogate mixes”—artificial blend signals formed from random source pairs. The process is as follows:

Given primary waveform $x(t)$ and secondary waveform $y(t)$ (normalized to equal power), a raw mix is formed via additive combination at 0 dB SNR:

$m(t) = x(t) + y(t)$

For “RMS-only” augmentation, the mix is temporally anchored: compute short-window RMS envelopes $E_x(t)$ and $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 0, then scale the mix to match $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 1's envelope:

$\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 2

For “spectral-only” augmentation, frequency magnitudes $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 3 are averaged to obtain a target $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 4, and per-source gain masks $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 5 are computed to spectrally align each source to the target before re-synthesizing and summing in the time domain.
Each surrogate mix is paired with a natural language prompt representing the intended infusion style (e.g., “behavior of X with textures from X and Y”).
Fine-tuning is timestepped: $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 6 uses surrogates, $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 7 reverts to single-sound noising/reconstruction. In the final protocol, three augmentation modes are used in equal proportion (RMS-only, spectral-only, both).

This “no-waste” regime ensures efficient utilization of the model's capacity: high-timestep training teaches global morphing, while lower-timestep learning preserves local detail restoration (Chu et al., 28 Jan 2026).

3. Mechanism and Semantics of Sound Infusion

Mix2Morph targets sound infusions—a subclass of static, asymmetric morphs where the primary source determines high-level timing and structure, and the secondary source provides timbral/textural attributes. The conditioning prompt encodes both the dual-source semantics and the specific augmentation mode. During inference, the absence of any explicit blend coefficient means that the degree of secondary source infusion is regulated indirectly through text and the learned denoising trajectory. The model architecture and surrogate mix injection at high timesteps endow Mix2Morph with the ability to synthesize coherent hybrid audio, preserving global timbral imprints from both sources.

No scalar “blend weight” is exposed to users; infusion is emergent from learned training dynamics and prompt conditioning.

4. Data Regime and Experimental Configuration

Pretraining utilizes several hundred hours of proprietary sound effects (SFX) and public general audio (CC-licensed), all resampled to 48 kHz. Fine-tuning draws random 8 s segments from the SFX pool, assembling surrogate mixes with equalized power and the prescribed augmentation sequence. All training segments are normalized prior to mixing/augmentation. The fine-tuning phase covers 50,000 steps, distributed as follows:

Surrogate mixes at $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 8
Single-sound objective at $\mathbf z_t = \sqrt{\alpha_t}\mathbf z_0 + \sqrt{1-\alpha_t}\boldsymbol\epsilon,\quad \boldsymbol\epsilon\sim\mathcal N(\mathbf0,\mathbf I)$ 9
Three augmentation branches at 33% each; the “none” branch is omitted in the final protocol due to observed performance degradation.

Training is performed on eight A100 GPUs. Standard batch size and learning rate details are not specified. Evaluation outputs are 3 s in duration.

5. Evaluation Methodology and Empirical Results

Mix2Morph is assessed using both objective metrics and subjective listening studies.

Objective metrics include:

Latent Compressibility Score (LCS):

where $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 1 are the leading principal components of DAC latents. LCS highly correlates with human morph perception ( $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 2, $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 3).

Correspondence (CORR):

FLAM audio–text cosine similarities $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 5 quantify the joint semantic presence of both sources.

Intermediateness (INT):

Directionality (DIR):

Preference for intended vs. reversed prompt, computed via softmax at temperature $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 7 and mapped to $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 8.

Fréchet Audio Distance (FAD): Lower values indicate higher perceptual fidelity.

System	LCS	CORR	INT	DIR	FAD	MOS	Morph Rate
Mix2Morph	0.150	0.725	0.648	0.436	1.220	3.52	77%
Simple Mix	—	0.758	—	0.00	—	3.13	64%
LGrS	—	—	—	—	—	2.09	71%
MorphFader	—	—	—	—	—	1.73	35%

Simple mixing achieves high CORR yet zero DIR due to lack of directed infusion.
LGrS and SoundMorpher outperform in LCS but with diminished semantic correspondence and intermediateness, reflecting reduced perceptual coherence.

Subjective testing:

Twenty-five listeners rated clips from four systems (Mix2Morph, simple mix, LGrS, MorphFader) against 20 prompt styles. Mix2Morph achieved the highest overall mean opinion score (MOS = 3.52 overall, 4.00 on pure morphs), and the highest “morph rate” (77%). ANOVA confirmed statistical significance ( $p_\theta(\mathbf z_{t-1} | \mathbf z_t) \approx \mathcal N \left(\mathbf z_{t-1}; \tfrac{1}{\sqrt{\alpha_t}}\left[\mathbf z_t - \tfrac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\mathbf z_t, t)\right],\, \sigma_t^2\mathbf I\right)$ 9).

6. Analysis, Ablation Findings, and Limitations

Qualitative ablations demonstrate that allocating surrogate mixes exclusively to $L(\theta)=\mathbb E_{t,\mathbf z_0,\epsilon}\|\epsilon - \epsilon_\theta(\mathbf z_t, t, \text{text})\|^2$ 0 best preserves both structure and timbral attributes. Three-way augmentation (RMS, spectral, both) produces the most naturally fused outputs, while adding an unaugmented (raw mix) branch degrades results. Mix2Morph robustly supports a wide variety of sound classes, including impacts, textures, human voice, and SFX, reliably producing perceptually blended midpoints where alternatives either collapse to a single source or superimpose two signals without true morphing.

Documented limitations include occasional collapse to single-concept outputs for cross-class blends with substantial sparsity/density mismatch. No explicit “blend weight” control is available at inference—infusion strength cannot be continuously modulated by the user. Suggested future research includes developing audio-to-audio conditioning for user control, enabling time-varying morph schedules, and integrating adjustable infusion strengths.

7. Significance and Future Directions

Mix2Morph demonstrates that high-fidelity, perceptually coherent sound morphing is achievable in a text-to-audio diffusion model by replacing expensive dedicated morph datasets with inexpensive noisy surrogate mixes and restricting surrogate targets to high diffusion timesteps. This strategy leverages pretrained single-sound weights for fine structure, whilst enabling robust, emergent morphing through text-prompt and augmentation regime. Objective and subjective tests establish Mix2Morph's superiority over direct mixing and prior morph baselines, positioning it as an advanced tool for concept-driven, semantic audio generation and sound design (Chu et al., 28 Jan 2026). Future explorations in continuous control and dynamic morphing schedules are anticipated to expand its utility and flexibility.

Markdown Report Issue Upgrade to Chat

References (1)

Mix2Morph: Learning Sound Morphing from Noisy Mixes (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mix2Morph.

Mix2Morph: Text-to-Audio Sound Morphing

1. Architectural Foundations and Model Specification

2. Noisy Surrogate Mixes and Fine-tuning Protocol

3. Mechanism and Semantics of Sound Infusion

4. Data Regime and Experimental Configuration

5. Evaluation Methodology and Empirical Results

Objective metrics include:

Subjective testing:

6. Analysis, Ablation Findings, and Limitations

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mix2Morph: Text-to-Audio Sound Morphing

1. Architectural Foundations and Model Specification

2. Noisy Surrogate Mixes and Fine-tuning Protocol

3. Mechanism and Semantics of Sound Infusion

4. Data Regime and Experimental Configuration

5. Evaluation Methodology and Empirical Results

Objective metrics include:

Subjective testing:

6. Analysis, Ablation Findings, and Limitations

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research