- The paper shows that incorporating psychologically-informed chain-of-thought prompts improves metaphor interpretation, especially in smaller LLMs like Curie.
- Methodology includes using few-shot examples and two rationale types (QUD and Similarity) to guide structured reasoning in paraphrasing metaphorical expressions.
- Evaluation reveals that prompts reduce dependence on metaphor familiarity, enabling models to engage in systematic reasoning akin to human cognitive processes.
This paper investigates whether LLMs like GPT-3 can better understand metaphors when guided by chain-of-thought (CoT) prompts informed by cognitive psychology theories. Traditional probabilistic models of language understanding are interpretable but require hand-crafting for specific domains, while LLMs possess broad knowledge but lack structured reasoning and interpretability. The authors aim to bridge this gap by using CoT prompts to introduce structured reasoning, based on psychological models of metaphor comprehension, into LLMs.
Methodology
- Dataset: The researchers used metaphors from the Katz corpus (non-literary and comprehensible literary examples). For each metaphor (e.g., "A bagpipe is a newborn baby"), they created four candidate paraphrases ranked by appropriateness:
- Best (4): Captures the core meaning (e.g., "A bagpipe is loud.")
- Second-best (3): Transfers a less apt property (e.g., "A bagpipe is delicate.")
- Irrelevant (2): States a fact about the subject unrelated to the object (e.g., "A bagpipe is a musical instrument.")
- Worst (1): Expresses the opposite meaning (e.g., "A bagpipe is quiet.")
The dataset was split into training (for prompt examples), development (for tuning), and test sets.
- Models: The experiments utilized two GPT-3 models via the OpenAI API:
text-davinci-002
(larger, ~175B parameters) and text-curie-001
(smaller, ~6.7B parameters), with temperature set to 0.2 for consistency.
- Prompting: Few-shot (10 examples) prompts were designed. Each example included the metaphor, the four paraphrase options, and an answer rationale. Two types of psychologically-informed rationales were tested:
- QUD (Question Under Discussion): Based on the idea that metaphors implicitly answer a question about the subject. The rationale identifies this question, the property transferred from the object to the subject, and links it to the best paraphrase.
1
2
3
4
|
Example Rationale (QUD):
"The speaker is addressing the question 'How does a bagpipe sound?'
The speaker answers this question by comparing a bagpipe to a newborn baby.
A newborn baby is loud, so the speaker is saying a) a bagpipe is loud." |
- Similarity: Based on comparison accounts, focusing on shared properties. The rationale highlights the relevant property of the object, states the subject shares it, and selects the paraphrase.
1
2
3
|
Example Rationale (Similarity):
"A newborn baby is loud.
A bagpipe is also loud, so the speaker is saying a) a bagpipe is loud." |
- Baselines: Performance was compared against:
- Random chance (expected mean score: 2.5)
- No Rationale (examples with only the correct answer)
- True Non-explanations (examples with true but irrelevant statements before the answer)
- Subject-Object (rationale only identifies the subject and object)
- Options Only (model sees only the paraphrases, not the metaphor)
- Evaluation: The primary metric was the mean appropriateness score (1-4) of the chosen paraphrase. They also analyzed failure rates (unparseable responses) and used Bayesian regression and correlations to compare conditions. The influence of metaphor familiarity (using existing human ratings) was also investigated.
Results
- Model Differences: DaVinci performed significantly better than Curie across all conditions. Without rationales, DaVinci achieved a high mean score (3.71), while Curie was near random chance (2.49).
- Effect of Rationales:
- For Curie, QUD and Similarity prompts significantly improved performance over chance and the No Rationale baseline (though the difference from No Rationale was only borderline significant for QUD). QUD prompts were particularly effective for Curie.
- For DaVinci, which already performed well, the Subject-Object prompt yielded the highest score (3.84), suggesting that even simply identifying the metaphor's components helped this powerful model. The QUD and Similarity prompts also maintained high performance but didn't surpass the Subject-Object or No Rationale baselines significantly.
- Familiarity Analysis: DaVinci's performance without rationales showed a significant correlation with metaphor familiarity, suggesting it might rely on having seen similar metaphors during training. This correlation disappeared when using psychologically-informed rationales (QUD, Similarity), hinting that these prompts encourage more systematic reasoning, reducing reliance on familiarity, similar to how humans engage deliberate thought for novel metaphors. This effect was less clear for Curie.
- Error Types: Errors typically involved either a lack of semantic nuance (e.g., understanding "roots hold soil" but failing to connect it to "stabilizing" for memories) or seemingly random disconnects between the reasoning steps and the final answer choice.
Conclusions
The paper demonstrates that psychologically-informed chain-of-thought prompts can improve LLM performance on metaphor understanding, particularly for less capable models like Curie. These prompts appear to guide models towards more structured, step-by-step reasoning analogous to cognitive theories. For more powerful models like DaVinci, while baseline performance is high, rationales might help reduce reliance on memorized examples (familiarity) and promote more generalizable, systematic reasoning, especially for novel metaphors. The authors suggest future work using more challenging and unfamiliar metaphors to further probe these effects.