Combined conditioning of text and vocal imitation at inference

Determine the combined effect of simultaneously conditioning the PromptSep latent diffusion audio source separation model on both a textual description of the target sounds and a vocal imitation recording at inference time, rather than conditioning on a single modality alone.

Background

PromptSep is designed to perform open-vocabulary audio source separation using multimodal conditioning, supporting both text prompts and vocal imitations. During training, the model is conditioned on either text or vocal imitation, but not both at the same time.

The paper explicitly notes that while both modalities can be provided together at inference, the effect of jointly using text and vocal imitation conditioning has not been explored. Clarifying how joint conditioning influences separation performance and behavior relative to single-modality conditioning remains unresolved.

References

During training, the model is always conditioned on either text or vocal imitation, but not both simultaneously. While both conditions can be provided at inference time, their combined effect is not explored in this work and is left for future investigation.

PromptSep: Generative Audio Separation via Multimodal Prompting (2511.04623 - Wen et al., 6 Nov 2025) in Section 3.1 (Training Datasets), Training Specification