i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data (2305.12311v1)
Abstract: The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.
- Ziyi Yang (77 papers)
- Mahmoud Khademi (17 papers)
- Yichong Xu (42 papers)
- Reid Pryzant (17 papers)
- Yuwei Fang (31 papers)
- Chenguang Zhu (100 papers)
- Dongdong Chen (164 papers)
- Yao Qian (37 papers)
- Mei Gao (8 papers)
- Yi-Ling Chen (13 papers)
- Robert Gmyr (20 papers)
- Naoyuki Kanda (61 papers)
- Noel Codella (21 papers)
- Bin Xiao (93 papers)
- Yu Shi (153 papers)
- Lu Yuan (130 papers)
- Takuya Yoshioka (77 papers)
- Michael Zeng (76 papers)
- Xuedong Huang (22 papers)