Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2405.16136v1)

Published 25 May 2024 in cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We introduce C3LLM (Conditioned-on-Three-Modalities LLMs), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the LLM structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we use the discrete representation for audio generation and compress their semantic meanings into acoustic tokens, similar to adding "acoustic vocabulary" to LLM. Third, our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model, providing more versatility in an end-to-end fashion. Our C3LLM achieves improved results through various automated evaluation metrics, providing better semantic alignment compared to previous methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. AI@Meta. Llama 3 model card, 2024.
  2. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
  3. Vggsound: A large-scale audio-visual dataset. In ICASSP, 2020.
  4. Generating visually aligned sound from videos. IEEE Transactions on Image Processing, 2020.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  6. Simple and controllable music generation. arXiv:2306.05284, 2024.
  7. Clipsonic: Text-to-audio synthesis with unlabeled videos and pretrained languagevision models. WASPAA, 2023.
  8. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
  9. Clotho: an audio captioning datase. In Proceedings of the ICASSP, pages 736––740, 2020.
  10. High fidelity neural audio compression. In CVPR, 2023.
  11. High fidelity neural audio compression. arXiv:2210.13438, 2022.
  12. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  13. Andrea Agostinelli et. al. Musiclm: Generating music from text. arXiv:2301.11325, 2023.
  14. Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  15. Peihao Chen et al. Generating visually aligned sound from videos. TIP, 2020.
  16. Thomas Mesnard et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv:2403.08295, 2024.
  17. Yusong Wu et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv:2211.06687, 2022.
  18. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
  19. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  20. Listen, think, and understand. arXiv:2305.10790, 2023b., 2024.
  21. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.
  22. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868, 2022.
  23. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  24. Visual instruction tuning. arXiv:2304.08485, 2024.
  25. Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023.
  26. Taming visually guided sound generation. arXiv:2110.08791, 2021.
  27. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. arXiv preprint arXiv:2401.17690, 2024.
  28. Etienne Labbé. aac-metrics: Metrics for evaluating automated audio captioning systems for pytorch. https://github.com/Labbeti/aac-metrics/, 2013.
  29. Mind the gap: Understanding the modality gap in multimodal contrastive representation learning. arXiv:2203.02053, 2022.
  30. World model on million-length video and language with ringattention. arXiv preprint, 2024.
  31. Mustango: Toward controllable text-to-music generation. arXiv:2311.08355, 2023.
  32. Clipcap: Clip prefix for image captioning. arXiv:2111.09734, 2021.
  33. OpenAI. Introducing chatgpt, 2022.
  34. OpenAI. Gpt-4 technical report, 2023.
  35. OpenAI. Llama: Open and efficient foundation language models, 2023.
  36. Learning transferable visual models from natural language supervision. In ICML, 2021.
  37. Language models are unsupervised multitask learners, 2019.
  38. I hear your true colors: Image guided audio generation. In ICASSP, 2023.
  39. Audio-visual llm for video understanding. arXiv:2312.06720, 2024.
  40. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  41. Emu: Generative pretraining in multimodality. arXiv:2307.05222, 2024.
  42. Codi-2: In-context, interleaved, and interactive any-to-any generation. arXiv:2311.18775, 2023.
  43. Any-to-any generation via composable diffusion. arXiv:2305.11846, 2023.
  44. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  45. Attention is all you need. In NeurIPS, 2017.
  46. Neural codec language models are zero-shot text to speech synthesizers. arXiv:2301.02111, 2023.
  47. Customvideo: Customizing text-to-video generation with multiple subjects. arXiv:2401.09962, 2024.
  48. Next-gpt: Any-to-any multimodal llm. arXiv:2309.05519, 2023.
  49. Sonicvisionlm: Playing sound with vision language models. arXiv:2401.04394, 2024.
  50. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv:2207.09983, 2022.
  51. Video-llama: An instruction-tuned audio-visual language model for video understanding. CoRR, abs/2306.02858, 2023.
  52. C3net: Compound conditioned controlnet for multimodal content generation. arXiv:2311.17951, 2023.
  53. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  54. Visual to sound: Generating natural sound for videos in the wild. In CVPR, 2018.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zixuan Wang (82 papers)
  2. Qinkai Duan (2 papers)
  3. Yu-Wing Tai (123 papers)
  4. Chi-Keung Tang (81 papers)
Citations (2)