Combined conditioning of text and vocal imitation at inference
Determine the combined effect of simultaneously conditioning the PromptSep latent diffusion audio source separation model on both a textual description of the target sounds and a vocal imitation recording at inference time, rather than conditioning on a single modality alone.
Sponsor
References
During training, the model is always conditioned on either text or vocal imitation, but not both simultaneously. While both conditions can be provided at inference time, their combined effect is not explored in this work and is left for future investigation.
— PromptSep: Generative Audio Separation via Multimodal Prompting
(2511.04623 - Wen et al., 6 Nov 2025) in Section 3.1 (Training Datasets), Training Specification