i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data (2305.12311v1)

Published 21 May 2023 in cs.CL, cs.AI, cs.CV, cs.LG, and eess.AS

Abstract: The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (19)

Ziyi Yang (77 papers)
Mahmoud Khademi (17 papers)
Yichong Xu (42 papers)
Reid Pryzant (17 papers)
Yuwei Fang (31 papers)
Chenguang Zhu (100 papers)
Dongdong Chen (164 papers)
Yao Qian (37 papers)
Mei Gao (8 papers)
Yi-Ling Chen (13 papers)
Robert Gmyr (20 papers)
Naoyuki Kanda (61 papers)
Noel Codella (21 papers)
Bin Xiao (93 papers)
Yu Shi (153 papers)
Lu Yuan (130 papers)
Takuya Yoshioka (77 papers)
Michael Zeng (76 papers)
Xuedong Huang (22 papers)

Citations (2)

View on Semantic Scholar

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data (2305.12311v1)

Related Papers