- The paper demonstrates that multimodal models can infer sound symbolism from text and imagery, though with less consistency than human judgment.
 
        - It shows that closed-source models like GPT-4 perform better in Kiki-Bouba and Mil-Mal tasks, especially with enhanced context prompts.
 
        - The findings suggest that incorporating explicit sound-symbolism training can improve model accuracy in psycholinguistic applications.
 
    
   
 
      Sound Symbolism Experiments with Multimodal LLMs
The paper "With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal LLMs" by Loakman, Li, and Lin explores the augmented capabilities of Vision LLMs (VLMs) and LLMs to grasp sound-based phenomena through abstract reasoning derived from orthography and imagery alone. This line of inquiry is directed towards understanding if such models, having access only to vision and text modalities, can replicate human-like characteristics when interpreting sound symbolism. The paper focuses on the classical Kiki-Bouba shape symbolism and Mil-Mal magnitude symbolism tasks along with comparing human judgements of linguistic iconicity with those of LLMs.
Analysis of Classic Psycholinguistic Phenomena
Sound symbolism denotes a non-arbitrary relationship between speech sounds and the meanings of the words they constitute. The research leverages LLMs and VLMs to analyze sound symbolism by conducting a series of psycholinguistic tasks designed to test the models' abilities to implicitly understand these phenomena. The experiments considered include:
- Shape Symbolism (Kiki-Bouba Effect): This experiment involves associating pseudowords with shapes based on their properties, such as spikiness or roundness. The results for this task indicated that closed-source models like GPT-4 showed higher levels of agreement with human judgment, especially when models were provided with additional task-specific prompts. Although none of the models consistently aligned with human responses, this discrepancy might be attributed to factors like data contamination or positional biases.
 
- Magnitude Symbolism (Mil-Mal Effect): Magnitude symbolism involves associating certain vowels with perceived physical size (e.g., "Mil" for smaller entities and "Mal" for larger entities). Here, the closed-source GPT-4 again displayed higher accuracy compared to open-source counterparts like LLaVA models. Notably, when provided with additional task-context, the models performed significantly better, indicating a fundamental understanding of the relationship between sound and size.
 
- Iconicity Ratings: This task sought to compare LLM judgments of iconicity against human ratings by examining linguistic iconicity, which measures the extent to which a word’s form represents its concept. Various models, including GPT-4, GPT-3.5-Turbo, and different versions of LLaMA-2, were evaluated. The findings highlighted a positive correlation between model size and the ability to emulate human judgments, with GPT-4 demonstrating the highest levels of agreement.
 
Implications and Future Directions
The paper reveals that VLMs and LLMs exhibit varying degrees of human-like sound symbolism understanding based on the context provided and the size of the models. These results suggest that models can implicitly learn sound symbolism from orthographic sequences present in the training data, although not as effectively as humans due to missing auditory data and more explicit training focused on sound attributes. This capability has significant implications for advancing tasks in sentiment analysis, creative content generation, and marketing.
From a theoretical standpoint, these results underline the utility of multimodal training data in fostering more comprehensive language understanding within models. For future research, explicit pre-training on sound-symbolism-centric datasets could significantly augment model performance in related tasks. Additionally, exploring techniques to enhance task-specific context-awareness within prompts could optimize existing models for various applications, leading to more nuanced and versatile NLP systems.
Conclusion
The paper offers an insightful exploration into the extent to which modern VLMs and LLMs can mimic human sound symbolism without direct auditory input. Through meticulous experiments, it elucidates the promising yet limited capabilities of these models, paving the way for further enhancements in multimodal AI research. The findings advocate for a balanced approach integrating sound-symbolism-focused training and refined task prompts to better align model outputs with human perception in psycholinguistic domains.