- The paper finds that without explicit prompts, MLLMs do not naturally achieve efficient, human-like conversational adaptation.
- The ICCA framework automates the evaluation by comparing conversational behaviors of models against human interaction patterns.
- Experimental results show that while historical context boosts accuracy, models struggle with complex image handling and maintaining lexical consistency.
Evaluating In-context Conversational Adaptation in Multimodal LLMs
The paper "Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs" by Yilun Hua and Yoav Artzi investigates the capacity of multimodal LLMs (MLLMs) to adapt their linguistic behavior for efficient communication during interactions. This paper draws on the established phenomena in human communication where interlocutors progressively develop more concise and efficient ways to refer to concepts and objects through the formation of ad-hoc linguistic conventions. The authors have implemented an automated evaluation framework, ICCA (In-context Conversational Adaptation), to analyze this behavior in several state-of-the-art MLLMs.
Methodological Framework: ICCA
The ICCA framework is designed to simulate human interactions using a corpus of reference game interactions and facilitate fully automated evaluation without additional data collection. The reference game setup used involves a speaker describing an image from a set within a shared context, and a listener selecting the referenced image based on the description. The paper's core approach is to compare changes in model behavior during interactions, either as a speaker or listener, against changes observed in human interactions.
Experimental Setup and Models
The paper evaluates five prominent MLLMs: IDEFICS, LLaVa-1.5, GPT4-vision, Gemini 1.0 Pro Vision, and Claude 3 opus. The experiments are divided into two main categories: model-as-speaker and model-as-listener.
Model-as-speaker experiments: These experiments assess whether MLLMs can spontaneously adapt their language to become more efficient, similar to human interlocutors.
- S1 (Standard Speaker): Models are evaluated with the basic game instruction.
- S2 (Gricean Instruction): Models receive instructions to follow the Gricean quantity maxim, instructing them to be informative but not overly so.
- S3 (Explicit Instruction): Models are explicitly instructed to reduce message length as the interaction progresses.
- S4 (Explicit Instruction + Consistency Request): Besides reducing message length, models are instructed to maintain lexical consistency.
Model-as-listener experiments: These experiments focus on evaluating the models' accuracy in identifying referenced images as interactions progress, providing different levels of historical and contextual information.
- L1 (Standard Listener): A setup where the referential context is reshuffled for each trial.
- L2 (No History): Models receive each trial in isolation.
- L3 (Images Once): The image context is shown only once at the beginning.
- L4 (No Shuffle): Images are shown every trial without shuffling.
Key Findings
Speaker Experiments: The paper reveals that, without explicit instructions, MLLMs (including GPT4, Gemini, and Claude) do not naturally exhibit the desired communication efficiency. Even when given Gricean instructions to be informative without excess information, the models still struggle to show the nuanced understanding displayed by humans. Only with highly engineered, explicit prompts did the models begin to exhibit behaviors such as reducing message length and maintaining lexical consistency akin to human interlocutors. However, these heavy-handed prompt interventions are not a feasible long-term solution.
Listener Experiments: Accuracy trajectories show that all models (GPT4, Gemini, and Claude) improve with access to historical context, particularly when the referential context is simplified (L3). However, GPT4 and IDEFICS still illustrate challenges when the number of images increases or when image positions are shuffled (L1). The models exhibited lower accuracy and performance degradation under these more complex settings, suggesting limits in their multi-image handling capabilities.
Implications and Future Directions
The findings underscore that contemporary MLLMs are limited in their ability to adapt spontaneously in a manner consistent with human communication efficiency. While they can passively understand and follow evolving language conventions, their spontaneous adaptation and efficiency in generating concise and stable language does not naturally emerge from their training processes.
Practical Implications: Enhancing the adaptive capacities of MLLMs could lead to more natural and efficient human-computer interactions. Reduced verbosity in responses would lower communication costs and improve the practical usability of conversational AI systems.
Theoretical Implications: The results imply an intrinsic disparity between human linguistic behavior and current MLLM architectures, rooted in the lack of perceived effort or cognitive cost by the models in communication, unlike human interlocutors.
Future Research: Addressing this gap involves developing training paradigms that instill a sense of communicative effort and efficiency in models. Research could focus on better understanding and encoding pragmatic language principles, or innovating architectural changes that enhance models' capacity for spontaneous linguistic adaptation. Additionally, improving models' ability to handle larger image contexts without performance degradation will be crucial for advancing multimodal interaction capabilities.
In conclusion, while MLLMs have shown impressive abilities in various tasks, the need for developing their spontaneous conversational adaptation abilities remains critical for achieving more human-like and efficient interactions.