- The paper presents the Stark dataset, addressing long-term multi-modal conversation gaps through personalized image sharing and detailed persona knowledge.
- It introduces the Mcu framework that leverages ChatGPT and a Plan-and-Execute image aligner to generate realistic dialogues and aligned images.
- The Ultron 7B model, fine-tuned on Stark, outperforms current methods in dialogue-to-image retrieval with superior Recall@K and MRR scores.
Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
The paper presents a comprehensive paper on enhancing the capability of multi-modal conversational agents through the development of a large-scale dataset named Stark, which focuses on long-term interactions and incorporates personalized image-sharing behavior. This paper addresses two significant gaps in current multi-modal conversational datasets: the limited representation of extended interaction sessions and the lack of personalization in image-sharing behaviors.
Dataset Construction
Stark is a pioneering dataset that encompasses social personas, diverse time intervals, multiple interaction sessions, and image-sharing moments. The dataset is constructed using a novel framework called Mcu (Multi-modal contextualization unit), which integrates ChatGPT for generating dialogues and a Plan-and-Execute image aligner for generating personalized images.
Key Contributions
- Stark Dataset: Introduces a large-scale social long-term multi-modal conversation dataset that includes detailed persona-based information, realistic time intervals, and personalized images.
- Mcu Framework: Proposes a new framework that leverages LLMs and image generation models to produce long-term multi-modal conversations grounded in personal demographics.
- Multi-Modal Conversation Model, Ultron 7B: Develops a highly capable multi-modal conversation model fine-tuned to understand and generate relevant dialogues and retrieve appropriate images relevant to the conversational context.
Mcu Framework Details
Mcu operates in several stages:
- Demographic Initialization: Begins by collecting basic demographic information (age, gender, birthplace, residence).
- Persona Generation: Generates persona attributes and sentences using ChatGPT, highlighting various personal demographics and behaviors.
- Commonsense Knowledge Generation: Constructs commonsense knowledge graphs that provide insights into routines, goals, relationships, experiences, and characteristics, grounded on the demographic attributes.
- Event Sequence and Pre-Stored Device Images Generation: Utilizes ChatGPT to generate temporal event sequences and possible pre-stored device images descriptions relevant to user experiences.
- Multi-Modal Dialogue Generation: Constructs detailed dialogues using the temporal event sequences, with multiple sessions interlinked through episodic experiences and image sharing.
- Image Alignment: Executes a Plan-and-Execute framework to align appropriate images to the dialogue context using a combination of generative models, image database retrieval, and web search.
Dataset Analysis
Upon analysis, Stark shows significant variety and balance across different demographic attributes, age groups, and gender distributions. The dataset also ensures diverse persona information, portraying realistic and relatable user profiles. The wide range of device image categories and well-represented time intervals between session dialogues further enhance its authenticity.
Human Evaluation
Stark undergoes extensive human evaluation, demonstrating high scores in coherence, consistency, and relevance of image-sharing turns. Noteworthy is the dataset's ability to surpass existing datasets like DialogCC and MMDialog in terms of generating engaging, specific, and high-quality interactions, as corroborated by head-to-head comparisons.
When tested on the dialogue-to-image retrieval task, the Ultron 7B model trained on Stark displays significant improvements over existing methods, including recent large multi-modal models like LLaVA and GPT-4V. This is evident in stronger performance metrics such as Recall@K and MRR, highlighting the effective use of Stark in enhancing image-sharing behaviors in conversational models.
Implications and Future Work
The work paves the way for more nuanced, contextually aware multi-modal conversational agents capable of personalized interaction over extended periods. The inclusion of detailed persona information and realistic image-sharing moments makes Stark a valuable resource for further research and development in human-AI interaction.
Future research could focus on addressing some of the limitations such as ensuring human face consistency in personalized images and integrating finer personality traits and conversational styles into AI assistants. The ongoing advancements in generative models promise even richer and more contextually accurate datasets in the future.
In conclusion, the paper presents a robust and innovative dataset that substantially improves the multi-modal conversational capabilities of AI models, setting a new standard for future research in the domain.