Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Del Visual al Auditivo: Sonorización de Escenas Guiada por Imagen (2402.01385v1)

Published 2 Feb 2024 in eess.AS and cs.SD

Abstract: Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the greatest challenges for the automatic generation of multimodal content. We present a processing flow that, starting from images extracted from videos, is able to sound them. We work with pre-trained models that employ complex encoders, contrastive learning, and multiple modalities, allowing complex representations of the sequences for their sonorization. The proposed scheme proposes different possibilities for audio mapping and text guidance. We evaluated the scheme on a dataset of frames extracted from a commercial video game and sounds extracted from the Freesound platform. Subjective tests have evidenced that the proposed scheme is able to generate and assign audios automatically and conveniently to images. Moreover, it adapts well to user preferences, and the proposed objective metrics show a high correlation with the subjective ratings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. María Sánchez (1 paper)
  2. Julián Arias (1 paper)
  3. Mateo Cámara (7 papers)
  4. Giulia Comini (7 papers)
  5. José Luis Blanco (5 papers)
  6. Juan Ignacio Godino (1 paper)
  7. Luis Alfonso Hernández (1 paper)
  8. Adam Gabrys (8 papers)
  9. Laura Fernández (4 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.