This paper explores the creation of "images that sound," which are 2D representations (specifically, spectrograms) that are designed to be simultaneously visually meaningful as images and acoustically meaningful when played as sounds. The core idea is to sample from the intersection of the probability distributions of natural images and natural spectrograms.
The authors propose a simple, zero-shot method leveraging pre-trained text-to-image and text-to-spectrogram diffusion models that operate within a shared latent space. The key insight is that the score functions (or noise estimates) from different diffusion models can be combined to sample from the product of their respective data distributions.
Implementation Details:
- Model Selection: The method requires two diffusion models trained on different modalities but sharing the same latent space. The authors use Stable Diffusion v1.5 [rombach2022high] for image generation and Auffusion [xue2024auffusion] for audio generation. Auffusion is a fine-tuned version of Stable Diffusion v1.5 on log-mel spectrograms, ensuring compatibility in the latent space. A pre-trained VAE encoder () and decoder () are used to convert between pixel/spectrogram space and the shared latent space.
- Multimodal Denoising: The generation process starts with a noisy latent variable . At each denoising step , noise estimates are computed from both the visual diffusion model () using the image text prompt () and the audio diffusion model () using the audio text prompt (). Classifier-Free Guidance (CFG) [ho2022classifier] is applied to both estimates with guidance scales and .
\begin{align*} \boldsymbol{\epsilon}{v}{(t)} &= \boldsymbol{\epsilon}{\phi,v}(\mathbf{z}t; \varnothing, t) + \gamma_v (\boldsymbol{\epsilon}{\phi,v}(\mathbf{z}t; y_v, t) - \boldsymbol{\epsilon}{\phi,v}(\mathbf{z}t; \varnothing, t)) \ \boldsymbol{\epsilon}{a}{(t)} &= \boldsymbol{\epsilon}{\phi,a}(\mathbf{z}_t; \varnothing, t) + \gamma_a (\boldsymbol{\epsilon}{\phi,a}(\mathbf{z}t; y_a, t) - \boldsymbol{\epsilon}{\phi,a}(\mathbf{z}_t; \varnothing, t))\end{align*}
These estimates are then combined using a weighted average to get a multimodal noise estimate :
where and are time-dependent weights. This combined estimate is used in the DDIM [song2020denoising] reverse process to obtain the next latent .
- Warm-Starting: The authors found it beneficial to warm-start the denoising process by initially giving more weight to one modality's noise estimate. This is controlled by defining and , where and are Heaviside step functions. and determine the duration of warm-starting for audio and visual models, respectively. An audio-first warm-up () provided the best balance in experiments.
- Decoding and Vocoding: After the iterative denoising process yields a clean latent , it is decoded back to a spectrogram $\mathbf{\hat{x} = \mathcal{D}(\mathbf{z}_{0})$. This spectrogram is typically grayscale. To convert it into an audible waveform, a pre-trained vocoder (like HiFi-GAN [kong2020hifi]) or the Griffin-Lim algorithm [griffin1984signal] is used. The authors use HiFi-GAN for main experiments.
- Colorization: Optionally, the grayscale spectrogram image can be colorized to make it more visually appealing. Since these spectrograms are out-of-distribution for standard colorization models, the authors use a projection-based method inspired by Factorized Diffusion [geng2024factorized]. This involves sampling a color image diffusion model while constraining the grayscale version of the intermediate noisy image to match the generated spectrogram at each step.
Practical Applications and Results:
The primary application is a novel form of multimodal art and creative expression, allowing artists to compose images and sounds onto a single canvas representation.
- Qualitative Results: The generated spectrograms exhibit visual patterns corresponding to the image prompt while producing sounds related to the audio prompt (Figure 1, 2, 4). Interesting emergent effects are observed where visual elements align with acoustic features (e.g., castle towers aligning with bell onsets).
- Quantitative Evaluation: Using CLIP [radford2021learning] for image alignment and CLAP [wu2023large] for audio alignment, the proposed method (Ours) outperformed baseline approaches (SDS, Imprint) on 100 random prompt pairs (Table 1). Compared to generating images or spectrograms alone, the method achieves a trade-off, performing better than cross-modal baselines but not reaching the single-modality performance (as expected, since it's a harder joint generation task).
- Human Studies: In 2AFC human evaluations on 7 hand-selected prompt pairs, participants preferred the proposed method's results over baselines in terms of audio quality, visual quality, and audio-visual alignment in the majority of cases (Table 2).
- Computational Efficiency: The proposed method is significantly faster than the SDS-based baseline, taking seconds per sample compared to hours (on NVIDIA L40s).
- Vocoder Verification: A cycle consistency check (re-encoding the vocoder output back to a spectrogram) showed that the method generates actual spectrograms that look like images, rather than simply adversarial inputs to the vocoder (Figure 5).
Implementation Considerations and Limitations:
- Shared Latent Space: The method relies on the existence of pre-trained diffusion models from different modalities that share a compatible latent space. This might limit applicability if such models are not available for desired modalities or resolutions.
- Prompt Selection: The success of the composition is highly dependent on the compatibility of the image and audio prompts. Not all combinations yield high-quality results in both modalities simultaneously. Prompts that encourage areas of "silence" visually (like dark or lithograph styles) can improve audio quality.
- Fidelity Trade-off: Achieving high fidelity in both the visual and audio domains simultaneously remains challenging. Often, there's a trade-off between how clear the image looks and how natural the sound is. This might be due to inherent differences between the distributions or limitations of the base models.
- Base Model Quality: The quality of the generated results is constrained by the capabilities of the underlying pre-trained image and audio diffusion models.
- Potential Negative Impacts: The method could be used for steganography, embedding hidden visual information within audio files, which raises concerns about potential misuse.
Overall, the paper presents a practical and effective method for generating unique multimodal content by creatively combining existing generative models, demonstrating that compositional generation techniques can be extended across different data modalities.
The implementation of the core multimodal denoising loop would involve:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
import torch def multimodal_denoising_step( z_t, t, model_v, model_a, cond_v, cond_a, uncond_v, uncond_a, gamma_v, gamma_a, lambda_a_t, lambda_v_t ): # Get unconditional and conditional noise estimates from visual model noise_pred_uncond_v = model_v(z_t, t, uncond_v) noise_pred_cond_v = model_v(z_t, t, cond_v) # Compute CFG noise estimate for visual epsilon_v_t = noise_pred_uncond_v + gamma_v * (noise_pred_cond_v - noise_pred_uncond_v) # Get unconditional and conditional noise estimates from audio model noise_pred_uncond_a = model_a(z_t, t, uncond_a) noise_pred_cond_a = model_a(z_t, t, cond_a) # Compute CFG noise estimate for audio epsilon_a_t = noise_pred_uncond_a + gamma_a * (noise_pred_cond_a - noise_pred_uncond_a) # Combine noise estimates # Note: The paper uses lambda as weights directly on epsilon_a/v_t, # which is a common approach for multimodal guidance. # This deviates slightly from the standard DDIM update formula which typically # uses the combined epsilon to predict x_0 or epsilon_0, but it's a valid # gradient-based steering approach. # A simpler implementation might directly average the epsilons: # epsilon_combined = lambda_a_t * epsilon_a_t + lambda_v_t * epsilon_v_t # and then proceed with a standard DDIM step using epsilon_combined. # The paper's formulation implies steering the _predicted noise_ towards # a weighted average of the two modalities' preferred noise. # This can be implemented by computing the predicted x_0 for each modality # and then combining them, or more directly combining the predicted noise. # Let's follow the intuition of combining noise estimates for simplicity: epsilon_combined = lambda_a_t * epsilon_a_t + lambda_v_t * epsilon_v_t # Perform a DDIM step using epsilon_combined # This requires alpha_t, alpha_t_prev from the diffusion schedule # and potentially sigma_t (set to 0 for deterministic DDIM). # Let's assume alpha_t and alpha_t_prev are available. # Standard DDIM step predicts x_0 and then recalculates z_{t-1} alpha_t = ... # from schedule alpha_t_prev = ... # from schedule sqrt_alpha_t = torch.sqrt(alpha_t) sqrt_one_minus_alpha_t = torch.sqrt(1.0 - alpha_t) # Predict x_0 from z_t and epsilon_combined pred_x0 = (z_t - sqrt_one_minus_alpha_t * epsilon_combined) / sqrt_alpha_t # Compute z_{t-1} sigma_t = 0 # for deterministic DDIM direction_pointing_to_epsilon = torch.sqrt(1.0 - alpha_t_prev - sigma_t**2) * epsilon_combined z_t_minus_1 = torch.sqrt(alpha_t_prev) * pred_x0 + direction_pointing_to_epsilon + sigma_t * torch.randn_like(z_t) # add noise if sigma_t > 0 return z_t_minus_1 # # # # # # # |
This pseudocode illustrates the core iterative process of combining noise estimates at each step within a standard diffusion sampling loop. The exact implementation details, including the diffusion schedule (), would depend on the specific diffusion framework used (e.g., Diffusers library).