Zero-shot audio-to-image generation

Determine whether zero-shot audio-to-image generation—synthesizing semantically aligned images directly from audio inputs without per-class supervision—can be achieved under the constraints of limited training information, and establish effective approaches that enable accurate alignment in this setting.

Background

The paper introduces CatchPhrase, a framework for audio-to-image generation that mitigates cross-modal misalignment by generating enriched prompts from LLMs and audio captioning models, followed by multimodal-aware filtering and retrieval. Despite improvements in supervised settings using audio classification datasets, the authors note that zero-shot scenarios are not addressed.

Zero-shot audio-to-image generation would require producing images for classes or instances not seen during training, relying on limited or weak supervision. The authors explicitly state that this remains open due to the limited information available in current training data, highlighting a need for methods that can generalize without per-class supervision while maintaining semantic alignment between audio and generated images.

References

First, due to the limited information available in the training data, zero-shot audio-to-image generation remains an open and challenging problem that has yet to be thoroughly explored.

— CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation (2507.18750 - Oh et al., 24 Jul 2025) in Supplementary Materials, Section I (Limitation)

Zero-shot audio-to-image generation

Background

References

Related Problems