Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Images that Sound: Composing Images and Sounds on a Single Canvas (2405.12221v2)

Published 20 May 2024 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/

This paper explores the creation of "images that sound," which are 2D representations (specifically, spectrograms) that are designed to be simultaneously visually meaningful as images and acoustically meaningful when played as sounds. The core idea is to sample from the intersection of the probability distributions of natural images and natural spectrograms.

The authors propose a simple, zero-shot method leveraging pre-trained text-to-image and text-to-spectrogram diffusion models that operate within a shared latent space. The key insight is that the score functions (or noise estimates) from different diffusion models can be combined to sample from the product of their respective data distributions.

Implementation Details:

  1. Model Selection: The method requires two diffusion models trained on different modalities but sharing the same latent space. The authors use Stable Diffusion v1.5 [rombach2022high] for image generation and Auffusion [xue2024auffusion] for audio generation. Auffusion is a fine-tuned version of Stable Diffusion v1.5 on log-mel spectrograms, ensuring compatibility in the latent space. A pre-trained VAE encoder (E\mathcal{E}) and decoder (D\mathcal{D}) are used to convert between pixel/spectrogram space and the shared latent space.
  2. Multimodal Denoising: The generation process starts with a noisy latent variable zT\mathbf{z}_T. At each denoising step tt, noise estimates are computed from both the visual diffusion model (ϵϕ,v\boldsymbol{\epsilon}_{\phi,v}) using the image text prompt (yvy_v) and the audio diffusion model (ϵϕ,a\boldsymbol{\epsilon}_{\phi,a}) using the audio text prompt (yay_a). Classifier-Free Guidance (CFG) [ho2022classifier] is applied to both estimates with guidance scales γv\gamma_v and γa\gamma_a. \begin{align*} \boldsymbol{\epsilon}{v}{(t)} &= \boldsymbol{\epsilon}{\phi,v}(\mathbf{z}t; \varnothing, t) + \gamma_v (\boldsymbol{\epsilon}{\phi,v}(\mathbf{z}t; y_v, t) - \boldsymbol{\epsilon}{\phi,v}(\mathbf{z}t; \varnothing, t)) \ \boldsymbol{\epsilon}{a}{(t)} &= \boldsymbol{\epsilon}{\phi,a}(\mathbf{z}_t; \varnothing, t) + \gamma_a (\boldsymbol{\epsilon}{\phi,a}(\mathbf{z}t; y_a, t) - \boldsymbol{\epsilon}{\phi,a}(\mathbf{z}_t; \varnothing, t))\end{align*} These estimates are then combined using a weighted average to get a multimodal noise estimate ϵ~(t)\tilde{\boldsymbol{\epsilon}}^{(t)}:

    ϵ~(t)=λa(t)ϵa(t)+λv(t)ϵv(t)\tilde{\boldsymbol{\epsilon}}^{(t)} = \lambda_a^{(t)} \boldsymbol{\epsilon}_{a}^{(t)} + \lambda_v^{(t)} \boldsymbol{\epsilon}_{v}^{(t)}

    where λa(t)\lambda_a^{(t)} and λv(t)\lambda_v^{(t)} are time-dependent weights. This combined estimate is used in the DDIM [song2020denoising] reverse process to obtain the next latent zt1\mathbf{z}_{t-1}.

  3. Warm-Starting: The authors found it beneficial to warm-start the denoising process by initially giving more weight to one modality's noise estimate. This is controlled by defining λa(t)=wa(t)/(wa(t)+wv(t))\lambda_a^{(t)} = w_a^{(t)} / (w_a^{(t)} + w_v^{(t)}) and λv(t)=wv(t)/(wa(t)+wv(t))\lambda_v^{(t)} = w_v^{(t)} / (w_a^{(t)} + w_v^{(t)}), where wa(t)=H(taTt)w_a^{(t)} = H(t_aT - t) and wv(t)=H(tvTt)w_v^{(t)} = H(t_vT - t) are Heaviside step functions. tat_a and tvt_v determine the duration of warm-starting for audio and visual models, respectively. An audio-first warm-up (ta=1.0,tv=0.9t_a=1.0, t_v=0.9) provided the best balance in experiments.
  4. Decoding and Vocoding: After the iterative denoising process yields a clean latent z0\mathbf{z}_0, it is decoded back to a spectrogram $\mathbf{\hat{x} = \mathcal{D}(\mathbf{z}_{0})$. This spectrogram is typically grayscale. To convert it into an audible waveform, a pre-trained vocoder (like HiFi-GAN [kong2020hifi]) or the Griffin-Lim algorithm [griffin1984signal] is used. The authors use HiFi-GAN for main experiments.
  5. Colorization: Optionally, the grayscale spectrogram image can be colorized to make it more visually appealing. Since these spectrograms are out-of-distribution for standard colorization models, the authors use a projection-based method inspired by Factorized Diffusion [geng2024factorized]. This involves sampling a color image diffusion model while constraining the grayscale version of the intermediate noisy image to match the generated spectrogram at each step.

Practical Applications and Results:

The primary application is a novel form of multimodal art and creative expression, allowing artists to compose images and sounds onto a single canvas representation.

  • Qualitative Results: The generated spectrograms exhibit visual patterns corresponding to the image prompt while producing sounds related to the audio prompt (Figure 1, 2, 4). Interesting emergent effects are observed where visual elements align with acoustic features (e.g., castle towers aligning with bell onsets).
  • Quantitative Evaluation: Using CLIP [radford2021learning] for image alignment and CLAP [wu2023large] for audio alignment, the proposed method (Ours) outperformed baseline approaches (SDS, Imprint) on 100 random prompt pairs (Table 1). Compared to generating images or spectrograms alone, the method achieves a trade-off, performing better than cross-modal baselines but not reaching the single-modality performance (as expected, since it's a harder joint generation task).
  • Human Studies: In 2AFC human evaluations on 7 hand-selected prompt pairs, participants preferred the proposed method's results over baselines in terms of audio quality, visual quality, and audio-visual alignment in the majority of cases (Table 2).
  • Computational Efficiency: The proposed method is significantly faster than the SDS-based baseline, taking seconds per sample compared to hours (on NVIDIA L40s).
  • Vocoder Verification: A cycle consistency check (re-encoding the vocoder output back to a spectrogram) showed that the method generates actual spectrograms that look like images, rather than simply adversarial inputs to the vocoder (Figure 5).

Implementation Considerations and Limitations:

  • Shared Latent Space: The method relies on the existence of pre-trained diffusion models from different modalities that share a compatible latent space. This might limit applicability if such models are not available for desired modalities or resolutions.
  • Prompt Selection: The success of the composition is highly dependent on the compatibility of the image and audio prompts. Not all combinations yield high-quality results in both modalities simultaneously. Prompts that encourage areas of "silence" visually (like dark or lithograph styles) can improve audio quality.
  • Fidelity Trade-off: Achieving high fidelity in both the visual and audio domains simultaneously remains challenging. Often, there's a trade-off between how clear the image looks and how natural the sound is. This might be due to inherent differences between the distributions or limitations of the base models.
  • Base Model Quality: The quality of the generated results is constrained by the capabilities of the underlying pre-trained image and audio diffusion models.
  • Potential Negative Impacts: The method could be used for steganography, embedding hidden visual information within audio files, which raises concerns about potential misuse.

Overall, the paper presents a practical and effective method for generating unique multimodal content by creatively combining existing generative models, demonstrating that compositional generation techniques can be extended across different data modalities.

The implementation of the core multimodal denoising loop would involve:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch

def multimodal_denoising_step(
    z_t, t, model_v, model_a, cond_v, cond_a, uncond_v, uncond_a,
    gamma_v, gamma_a, lambda_a_t, lambda_v_t
):
    # Get unconditional and conditional noise estimates from visual model
    noise_pred_uncond_v = model_v(z_t, t, uncond_v)
    noise_pred_cond_v = model_v(z_t, t, cond_v)
    # Compute CFG noise estimate for visual
    epsilon_v_t = noise_pred_uncond_v + gamma_v * (noise_pred_cond_v - noise_pred_uncond_v)

    # Get unconditional and conditional noise estimates from audio model
    noise_pred_uncond_a = model_a(z_t, t, uncond_a)
    noise_pred_cond_a = model_a(z_t, t, cond_a)
    # Compute CFG noise estimate for audio
    epsilon_a_t = noise_pred_uncond_a + gamma_a * (noise_pred_cond_a - noise_pred_uncond_a)

    # Combine noise estimates
    # Note: The paper uses lambda as weights directly on epsilon_a/v_t,
    # which is a common approach for multimodal guidance.
    # This deviates slightly from the standard DDIM update formula which typically
    # uses the combined epsilon to predict x_0 or epsilon_0, but it's a valid
    # gradient-based steering approach.
    # A simpler implementation might directly average the epsilons:
    # epsilon_combined = lambda_a_t * epsilon_a_t + lambda_v_t * epsilon_v_t
    # and then proceed with a standard DDIM step using epsilon_combined.

    # The paper's formulation implies steering the _predicted noise_ towards
    # a weighted average of the two modalities' preferred noise.
    # This can be implemented by computing the predicted x_0 for each modality
    # and then combining them, or more directly combining the predicted noise.
    # Let's follow the intuition of combining noise estimates for simplicity:
    epsilon_combined = lambda_a_t * epsilon_a_t + lambda_v_t * epsilon_v_t

    # Perform a DDIM step using epsilon_combined
    # This requires alpha_t, alpha_t_prev from the diffusion schedule
    # and potentially sigma_t (set to 0 for deterministic DDIM).
    # Let's assume alpha_t and alpha_t_prev are available.
    # Standard DDIM step predicts x_0 and then recalculates z_{t-1}
    alpha_t = ... # from schedule
    alpha_t_prev = ... # from schedule
    sqrt_alpha_t = torch.sqrt(alpha_t)
    sqrt_one_minus_alpha_t = torch.sqrt(1.0 - alpha_t)

    # Predict x_0 from z_t and epsilon_combined
    pred_x0 = (z_t - sqrt_one_minus_alpha_t * epsilon_combined) / sqrt_alpha_t

    # Compute z_{t-1}
    sigma_t = 0 # for deterministic DDIM
    direction_pointing_to_epsilon = torch.sqrt(1.0 - alpha_t_prev - sigma_t**2) * epsilon_combined
    z_t_minus_1 = torch.sqrt(alpha_t_prev) * pred_x0 + direction_pointing_to_epsilon + sigma_t * torch.randn_like(z_t) # add noise if sigma_t > 0

    return z_t_minus_1

#
#
#
#
#
#
#

This pseudocode illustrates the core iterative process of combining noise estimates at each step within a standard diffusion sampling loop. The exact implementation details, including the diffusion schedule (αt\alpha_t), would depend on the specific diffusion framework used (e.g., Diffusers library).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (112)
  1. Self-supervised learning of audio-visual objects from video. European Conference on Computer Vision (ECCV), 2020.
  2. Aphex Twin. Formula, 1994. audio track.
  3. R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
  4. R. Arandjelovic and A. Zisserman. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018.
  5. Labelling unlabelled videos from scratch with multi-modal self-supervision. Advances in Neural Information Processing Systems, 33:4660–4671, 2020.
  6. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
  7. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  8. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  9. B. Buckle. Spectrogram art: A short history of musicians hiding visuals inside their tracks. Available from: https://mixmag.net/feature/spectrogram-art-music-aphex-twin, 2022. Mixmag article.
  10. Diffusion illusions: Hiding images in plain sight. arXiv preprint arXiv:2312.03817, 2023.
  11. Visual acoustic matching. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  12. Novel-view acoustic synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6409–6419, 2023.
  13. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770, 2023.
  14. Audio-visual synchronisation in the wild. arXiv preprint arXiv:2112.04432, 2021.
  15. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  16. Be everywhere-hear everything (bee): Audio scene reconstruction by sparse audio-visual samples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7853–7862, 2023.
  17. Real acoustic fields: An audio-visual room acoustics dataset and benchmark. In The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.
  18. Structure from silence: Learning scene structure from ambient sound. In 5th Annual Conference on Robot Learning, 2021.
  19. Sound localization from motion: Jointly learning sound direction and camera rotation. In International Conference on Computer Vision (ICCV), 2023.
  20. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  21. Adverb: Visually guided audio dereverberation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7884–7896, 2023.
  22. Classical Music Reimagined. Fun with spectrograms! how to make an image using sound and music. Available from: https://www.youtube.com/watch?v=N2DQFfID6eY, 2017. Youtube video.
  23. DeepFloyd Lab at StabilityAI. DeepFloyd IF: a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. https://www.deepfloyd.ai/deepfloyd-if, 2023.
  24. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  25. Conditional generation of audio from video via foley analogies. Computer Vision and Pattern Recognition (CVPR), 2023.
  26. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pages 8489–8510. PMLR, 2023.
  27. Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637–6647, 2020.
  28. Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems, 36:16222–16239, 2023.
  29. Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825, 2024.
  30. Self-supervised video forensics by audio-visual anomaly detection. Computer Vision and Pattern Recognition (CVPR), 2023.
  31. S. Forsgren and H. Martiros. Riffusion - Stable diffusion for real-time music generation, 2022.
  32. Visualechoes: Spatial visual representation learning through echolocation. In European Conference on Computer Vision (ECCV), 2020.
  33. R. Gao and K. Grauman. 2.5d visual sound. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  34. R. Gao and K. Grauman. Visualvoice: Audio-visual speech separation with cross-modal consistency. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  35. Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10457–10467, 2020.
  36. D. Geng and A. Owens. Motion guidance: Diffusion-based image editing with differentiable motion estimators. arXiv preprint arXiv:2401.18085, 2024.
  37. Factorized diffusion: Perceptual illusions by noise decomposition. arXiv:2404.11615, April 2024.
  38. Visual anagrams: Generating multi-view optical illusions with diffusion models. In CVPR, 2024.
  39. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  40. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  41. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  42. Contrastive audio-visual masked autoencoder. arXiv preprint arXiv:2210.07839, 2022.
  43. D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
  44. G. Gwardys and D. Grzywczak. Deep image features in music information retrieval. International Journal of Electronics and Telecommunications, 60:321–326, 2014.
  45. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  46. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  47. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 2002.
  48. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  49. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  50. J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  51. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
  52. Mix and localize: Localizing sound sources in mixtures. Computer Vision and Pattern Recognition (CVPR), 2022.
  53. Egocentric audio-visual object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22910–22921, 2023.
  54. Epic-sounds: A large-scale dataset of actions that sound. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  55. V. Iashin and E. Rahtu. Taming visually guided sound generation. arXiv preprint arXiv:2110.08791, 2021.
  56. Synchformer: Efficient synchronization from sparse cues. arXiv preprint arXiv:2401.16423, 2024.
  57. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  58. Pixels that sound. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 88–95. IEEE, 2005.
  59. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
  60. Av-nerf: Learning neural fields for real-world audio-visual scene synthesis. Advances in Neural Information Processing Systems, 36, 2024.
  61. Y.-B. Lin and G. Bertasius. Siamese vision transformers are scalable audio-visual learners. arXiv preprint arXiv:2403.19638, 2024.
  62. Vision transformers are parameter-efficient audio-visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2309, 2023.
  63. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  64. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
  65. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
  66. S. Luo and W. Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  67. T-vsl: Text-guided visual sound source localization in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  68. Learning spatial features from audio-visual correspondence in egocentric videos. arXiv preprint arXiv:2307.04760, 2023.
  69. Foleygen: Visually-guided audio generation. arXiv preprint arXiv:2309.10537, 2023.
  70. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  71. Learning state-aware visual representations from audible interactions. Advances in Neural Information Processing Systems, 35:23765–23779, 2022.
  72. S. Mo and P. Morgado. Localizing visual sounds the easy way. In European Conference on Computer Vision, pages 218–234. Springer, 2022.
  73. Self-supervised generation of spatial audio for 360 video. Advances in neural information processing systems, 31, 2018.
  74. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12475–12486, 2021.
  75. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021.
  76. Nine Inch Nails. Year zero, 2007. Music Album.
  77. Audio-visual glance network for efficient video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10150–10159, 2023.
  78. A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European conference on computer vision (ECCV), pages 631–648, 2018.
  79. N. Oxman. Sympawnies: animal portraits made of musical notations. Available from: https://www.youtube.com/@Sympawnies, 2023. Youtube Channel.
  80. Rethinking cnn models for audio classification. arXiv preprint arXiv:2007.11154, 2020.
  81. Can clip help sound source localization? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5711–5720, 2024.
  82. A fast griffin-lim algorithm. In 2013 IEEE workshop on applications of signal processing to audio and acoustics, pages 1–4. IEEE, 2013.
  83. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2023.
  84. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023.
  85. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  86. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  87. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  88. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  89. Sound source localization is all about cross-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7777–7787, 2023.
  90. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  91. Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR.
  92. Self-supervised visual acoustic matching. Advances in Neural Information Processing Systems, 36, 2024.
  93. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  94. Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  95. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  96. Eventfulness for interactive video alignment. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  97. Sound to visual scene generation by audio-to-visual latent alignment. Computer Vision and Pattern Recognition (CVPR), 2023.
  98. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
  99. The Beatles. Lucy in the sky with diamonds, 1967.
  100. Tool. 10,000 days, 2006. Volcano Entertainment.
  101. D. Ulyanov. Audio texture synthesis and style transfer. https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer, 2016.
  102. Zero-shot image restoration using denoising diffusion null-space model. The Eleventh International Conference on Learning Representations, 2023.
  103. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  104. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  105. Sonicvisionlm: Playing sound with vision language models. arXiv preprint arXiv:2401.04394, 2024.
  106. Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. arXiv preprint arXiv:2401.01044, 2024.
  107. Hiding video in audio via reversible generative models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1100–1109, 2019.
  108. Telling left from right: Learning spatial correspondence of sight and sound. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9932–9941, 2020.
  109. Scalable diffusion for materials generation. arXiv preprint arXiv:2311.09235, 2023.
  110. Cameras as rays: Pose estimation via ray diffusion. In International Conference on Learning Representations (ICLR), 2024.
  111. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1735–1744, 2019.
  112. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5553–5562, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ziyang Chen (91 papers)
  2. Daniel Geng (9 papers)
  3. Andrew Owens (52 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com