Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models (2209.08141v2)

Published 16 Sep 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Probabilistic models of language understanding are valuable tools for investigating human language use. However, they need to be hand-designed for a particular domain. In contrast, LLMs are trained on text that spans a wide array of domains, but they lack the structure and interpretability of probabilistic models. In this paper, we use chain-of-thought prompts to introduce structures from probabilistic models into LLMs. We explore this approach in the case of metaphor understanding. Our chain-of-thought prompts lead LLMs to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors. The latent variables and relationships chosen are informed by theories of metaphor understanding from cognitive psychology. We apply these prompts to the two largest versions of GPT-3 and show that they can improve performance in a paraphrase selection task.

Citations (16)

View on Semantic Scholar

Summary

The paper shows that incorporating psychologically-informed chain-of-thought prompts improves metaphor interpretation, especially in smaller LLMs like Curie.
Methodology includes using few-shot examples and two rationale types (QUD and Similarity) to guide structured reasoning in paraphrasing metaphorical expressions.
Evaluation reveals that prompts reduce dependence on metaphor familiarity, enabling models to engage in systematic reasoning akin to human cognitive processes.

This paper investigates whether LLMs like GPT-3 can better understand metaphors when guided by chain-of-thought (CoT) prompts informed by cognitive psychology theories. Traditional probabilistic models of language understanding are interpretable but require hand-crafting for specific domains, while LLMs possess broad knowledge but lack structured reasoning and interpretability. The authors aim to bridge this gap by using CoT prompts to introduce structured reasoning, based on psychological models of metaphor comprehension, into LLMs.

Methodology

Dataset: The researchers used metaphors from the Katz corpus (non-literary and comprehensible literary examples). For each metaphor (e.g., "A bagpipe is a newborn baby"), they created four candidate paraphrases ranked by appropriateness:
- Best (4): Captures the core meaning (e.g., "A bagpipe is loud.")
- Second-best (3): Transfers a less apt property (e.g., "A bagpipe is delicate.")
- Irrelevant (2): States a fact about the subject unrelated to the object (e.g., "A bagpipe is a musical instrument.")
- Worst (1): Expresses the opposite meaning (e.g., "A bagpipe is quiet.") The dataset was split into training (for prompt examples), development (for tuning), and test sets.
Models: The experiments utilized two GPT-3 models via the OpenAI API: text-davinci-002 (larger, ~175B parameters) and text-curie-001 (smaller, ~6.7B parameters), with temperature set to 0.2 for consistency.

Prompting: Few-shot (10 examples) prompts were designed. Each example included the metaphor, the four paraphrase options, and an answer rationale. Two types of psychologically-informed rationales were tested:

QUD (Question Under Discussion): Based on the idea that metaphors implicitly answer a question about the subject. The rationale identifies this question, the property transferred from the object to the subject, and links it to the best paraphrase.

Example Rationale (QUD):
"The speaker is addressing the question 'How does a bagpipe sound?'
The speaker answers this question by comparing a bagpipe to a newborn baby.
A newborn baby is loud, so the speaker is saying a) a bagpipe is loud."

Similarity: Based on comparison accounts, focusing on shared properties. The rationale highlights the relevant property of the object, states the subject shares it, and selects the paraphrase.
1 2 3
Example Rationale (Similarity): "A newborn baby is loud. A bagpipe is also loud, so the speaker is saying a) a bagpipe is loud."

Baselines: Performance was compared against:
- Random chance (expected mean score: 2.5)
- No Rationale (examples with only the correct answer)
- True Non-explanations (examples with true but irrelevant statements before the answer)
- Subject-Object (rationale only identifies the subject and object)
- Options Only (model sees only the paraphrases, not the metaphor)
Evaluation: The primary metric was the mean appropriateness score (1-4) of the chosen paraphrase. They also analyzed failure rates (unparseable responses) and used Bayesian regression and correlations to compare conditions. The influence of metaphor familiarity (using existing human ratings) was also investigated.

Results

Model Differences: DaVinci performed significantly better than Curie across all conditions. Without rationales, DaVinci achieved a high mean score (3.71), while Curie was near random chance (2.49).
Effect of Rationales:
- For Curie, QUD and Similarity prompts significantly improved performance over chance and the No Rationale baseline (though the difference from No Rationale was only borderline significant for QUD). QUD prompts were particularly effective for Curie.
- For DaVinci, which already performed well, the Subject-Object prompt yielded the highest score (3.84), suggesting that even simply identifying the metaphor's components helped this powerful model. The QUD and Similarity prompts also maintained high performance but didn't surpass the Subject-Object or No Rationale baselines significantly.
Familiarity Analysis: DaVinci's performance without rationales showed a significant correlation with metaphor familiarity, suggesting it might rely on having seen similar metaphors during training. This correlation disappeared when using psychologically-informed rationales (QUD, Similarity), hinting that these prompts encourage more systematic reasoning, reducing reliance on familiarity, similar to how humans engage deliberate thought for novel metaphors. This effect was less clear for Curie.
Error Types: Errors typically involved either a lack of semantic nuance (e.g., understanding "roots hold soil" but failing to connect it to "stabilizing" for memories) or seemingly random disconnects between the reasoning steps and the final answer choice.

Conclusions

The paper demonstrates that psychologically-informed chain-of-thought prompts can improve LLM performance on metaphor understanding, particularly for less capable models like Curie. These prompts appear to guide models towards more structured, step-by-step reasoning analogous to cognitive theories. For more powerful models like DaVinci, while baseline performance is high, rationales might help reduce reliance on memorized examples (familiarity) and promote more generalizable, systematic reasoning, especially for novel metaphors. The authors suggest future work using more challenging and unfamiliar metaphors to further probe these effects.

PDF Markdown

Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models (2209.08141v2)

Summary

Related Papers