Mix2Morph: Text-to-Audio Sound Morphing
- Mix2Morph is a text-to-audio diffusion model that achieves sound morphing by infusing a secondary source's timbral and textural attributes into a primary source's structural framework.
- It leverages a pretrained latent diffusion transformer with a VAE encoder, diffusion U-Net, and high-timestep surrogate mixes to generate perceptually coherent audio outputs.
- Empirical evaluations using objective metrics and subjective tests demonstrate that Mix2Morph outperforms conventional methods, setting a new state-of-the-art for controllable, concept-driven sound morphing.
Mix2Morph is a text-to-audio diffusion model designed to achieve sound morphing—specifically, sound infusion—without reliance on a dedicated morph dataset. By fine-tuning on “noisy surrogate mixes” injected only at later diffusion timesteps, the model generates stable, perceptually coherent audio morphs that integrate the salient qualities of two distinct source sounds. The dominant source supplies the structural and temporal characteristics, while the secondary source is “infused” for timbral and textural enrichment. Empirical results demonstrate that Mix2Morph sets a new state of the art for controllable, concept-driven sound morphing across a diverse set of audio categories (Chu et al., 28 Jan 2026).
1. Architectural Foundations and Model Specification
Mix2Morph is built on a pretrained latent-diffusion transformer backbone for single-sound text-to-audio generation, consistent with prior work on large-scale text-conditioned audio synthesis (e.g., [Evans et al. 2024], [Garcia et al. 2025]). The model stack comprises:
- A VAE encoder that maps raw 48 kHz stereo waveforms to a 256-dimensional latent sequence at 40 Hz. Reconstructions pass through a VAE decoder.
- A diffusion U-Net that operates on latent representations, conditioned via transformer cross-attention on text embeddings encoding semantic prompts.
- The diffusion process is modeled both in continuous-time SDE form,
and as discrete DDPM style steps:
%%%%1%%%%
The reverse (denoising) process is parameterized as
- The model is optimized by minimizing the standard L2 denoising objective on latents:
- Fine-tuning is performed for 50,000 steps exclusively on surrogate mixes for high diffusion timesteps (), preserving single-sound reconstruction learning for low .
A set of four “augmentation modes” (RMS, spectral, both, none) is introduced, with each mode encoded by distinct prompt tokens to cue the model regarding the expected sound fusion behavior.
2. Noisy Surrogate Mixes and Fine-tuning Protocol
The Mix2Morph training strategy circumvents the need for explicit morph datasets through the construction of “noisy surrogate mixes”—artificial blend signals formed from random source pairs. The process is as follows:
- Given primary waveform and secondary waveform (normalized to equal power), a raw mix is formed via additive combination at 0 dB SNR:
- For “RMS-only” augmentation, the mix is temporally anchored: compute short-window RMS envelopes and , then scale the mix to match 's envelope:
- For “spectral-only” augmentation, frequency magnitudes are averaged to obtain a target , and per-source gain masks are computed to spectrally align each source to the target before re-synthesizing and summing in the time domain.
- Each surrogate mix is paired with a natural language prompt representing the intended infusion style (e.g., “behavior of X with textures from X and Y”).
- Fine-tuning is timestepped: uses surrogates, reverts to single-sound noising/reconstruction. In the final protocol, three augmentation modes are used in equal proportion (RMS-only, spectral-only, both).
This “no-waste” regime ensures efficient utilization of the model's capacity: high-timestep training teaches global morphing, while lower-timestep learning preserves local detail restoration (Chu et al., 28 Jan 2026).
3. Mechanism and Semantics of Sound Infusion
Mix2Morph targets sound infusions—a subclass of static, asymmetric morphs where the primary source determines high-level timing and structure, and the secondary source provides timbral/textural attributes. The conditioning prompt encodes both the dual-source semantics and the specific augmentation mode. During inference, the absence of any explicit blend coefficient means that the degree of secondary source infusion is regulated indirectly through text and the learned denoising trajectory. The model architecture and surrogate mix injection at high timesteps endow Mix2Morph with the ability to synthesize coherent hybrid audio, preserving global timbral imprints from both sources.
No scalar “blend weight” is exposed to users; infusion is emergent from learned training dynamics and prompt conditioning.
4. Data Regime and Experimental Configuration
Pretraining utilizes several hundred hours of proprietary sound effects (SFX) and public general audio (CC-licensed), all resampled to 48 kHz. Fine-tuning draws random 8 s segments from the SFX pool, assembling surrogate mixes with equalized power and the prescribed augmentation sequence. All training segments are normalized prior to mixing/augmentation. The fine-tuning phase covers 50,000 steps, distributed as follows:
- Surrogate mixes at
- Single-sound objective at
- Three augmentation branches at 33% each; the “none” branch is omitted in the final protocol due to observed performance degradation.
Training is performed on eight A100 GPUs. Standard batch size and learning rate details are not specified. Evaluation outputs are 3 s in duration.
5. Evaluation Methodology and Empirical Results
Mix2Morph is assessed using both objective metrics and subjective listening studies.
Objective metrics include:
- Latent Compressibility Score (LCS):
where are the leading principal components of DAC latents. LCS highly correlates with human morph perception (, ).
- Correspondence (CORR):
FLAM audio–text cosine similarities quantify the joint semantic presence of both sources.
- Intermediateness (INT):
- Directionality (DIR):
Preference for intended vs. reversed prompt, computed via softmax at temperature and mapped to .
- Fréchet Audio Distance (FAD): Lower values indicate higher perceptual fidelity.
| System | LCS | CORR | INT | DIR | FAD | MOS | Morph Rate |
|---|---|---|---|---|---|---|---|
| Mix2Morph | 0.150 | 0.725 | 0.648 | 0.436 | 1.220 | 3.52 | 77% |
| Simple Mix | — | 0.758 | — | 0.00 | — | 3.13 | 64% |
| LGrS | — | — | — | — | — | 2.09 | 71% |
| MorphFader | — | — | — | — | — | 1.73 | 35% |
- Simple mixing achieves high CORR yet zero DIR due to lack of directed infusion.
- LGrS and SoundMorpher outperform in LCS but with diminished semantic correspondence and intermediateness, reflecting reduced perceptual coherence.
Subjective testing:
Twenty-five listeners rated clips from four systems (Mix2Morph, simple mix, LGrS, MorphFader) against 20 prompt styles. Mix2Morph achieved the highest overall mean opinion score (MOS = 3.52 overall, 4.00 on pure morphs), and the highest “morph rate” (77%). ANOVA confirmed statistical significance ().
6. Analysis, Ablation Findings, and Limitations
Qualitative ablations demonstrate that allocating surrogate mixes exclusively to best preserves both structure and timbral attributes. Three-way augmentation (RMS, spectral, both) produces the most naturally fused outputs, while adding an unaugmented (raw mix) branch degrades results. Mix2Morph robustly supports a wide variety of sound classes, including impacts, textures, human voice, and SFX, reliably producing perceptually blended midpoints where alternatives either collapse to a single source or superimpose two signals without true morphing.
Documented limitations include occasional collapse to single-concept outputs for cross-class blends with substantial sparsity/density mismatch. No explicit “blend weight” control is available at inference—infusion strength cannot be continuously modulated by the user. Suggested future research includes developing audio-to-audio conditioning for user control, enabling time-varying morph schedules, and integrating adjustable infusion strengths.
7. Significance and Future Directions
Mix2Morph demonstrates that high-fidelity, perceptually coherent sound morphing is achievable in a text-to-audio diffusion model by replacing expensive dedicated morph datasets with inexpensive noisy surrogate mixes and restricting surrogate targets to high diffusion timesteps. This strategy leverages pretrained single-sound weights for fine structure, whilst enabling robust, emergent morphing through text-prompt and augmentation regime. Objective and subjective tests establish Mix2Morph's superiority over direct mixing and prior morph baselines, positioning it as an advanced tool for concept-driven, semantic audio generation and sound design (Chu et al., 28 Jan 2026). Future explorations in continuous control and dynamic morphing schedules are anticipated to expand its utility and flexibility.