Zero-shot audio-to-image generation
Determine whether zero-shot audio-to-image generation—synthesizing semantically aligned images directly from audio inputs without per-class supervision—can be achieved under the constraints of limited training information, and establish effective approaches that enable accurate alignment in this setting.
References
First, due to the limited information available in the training data, zero-shot audio-to-image generation remains an open and challenging problem that has yet to be thoroughly explored.
— CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation
(2507.18750 - Oh et al., 24 Jul 2025) in Supplementary Materials, Section I (Limitation)