Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models (2307.15898v2)

Published 29 Jul 2023 in cs.SD, cs.AI, and eess.AS

Abstract: Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results demonstrate the efficacy of UniBriVL in downstream tasks and its ability to choose appropriate images from audio. The proposed approach has the potential for various applications such as speech recognition, music signal processing, and captioning systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sen Fang (16 papers)
  2. Bowen Gao (14 papers)
  3. Yangjian Wu (2 papers)
  4. Teik Toe Teoh (4 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.