DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention (2309.14327v3)

Published 25 Sep 2023 in cs.CV and cs.CL

Abstract: Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize LLMs by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and LLMs in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter LLM size, representing a significant advancement in multi-modal LLMs and setting a solid foundation for future explorations.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (48)

Authors (9)

Zhewei Yao (64 papers)
Xiaoxia Wu (30 papers)
Conglong Li (15 papers)
Minjia Zhang (54 papers)
Heyang Qin (6 papers)
Olatunji Ruwase (20 papers)
Ammar Ahmad Awan (15 papers)
Samyam Rajbhandari (21 papers)
Yuxiong He (59 papers)

Citations (8)

View on Semantic Scholar

YouTube

Show All Videos

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention (2309.14327v3)

Related Papers

YouTube