Large Language Models Understand and Can be Enhanced by Emotional Stimuli (2307.11760v7)

Published 14 Jul 2023 in cs.CL, cs.AI, and cs.HC

Abstract: Emotional intelligence significantly impacts our daily behaviors and interactions. Although LLMs are increasingly viewed as a stride toward artificial general intelligence, exhibiting impressive performance in numerous tasks, it is still uncertain if LLMs can genuinely grasp psychological emotional stimuli. Understanding and responding to emotional cues gives humans a distinct advantage in problem-solving. In this paper, we take the first step towards exploring the ability of LLMs to understand emotional stimuli. To this end, we first conduct automatic experiments on 45 tasks using various LLMs, including Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4. Our tasks span deterministic and generative applications that represent comprehensive evaluation scenarios. Our automatic experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts (which we call "EmotionPrompt" that combines the original prompt with emotional stimuli), e.g., 8.00% relative performance improvement in Instruction Induction and 115% in BIG-Bench. In addition to those deterministic tasks that can be automatically evaluated using existing metrics, we conducted a human study with 106 participants to assess the quality of generative tasks using both vanilla and emotional prompts. Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks (10.9% average improvement in terms of performance, truthfulness, and responsibility metrics). We provide an in-depth discussion regarding why EmotionPrompt works for LLMs and the factors that may influence its performance. We posit that EmotionPrompt heralds a novel avenue for exploring interdisciplinary knowledge for human-LLMs interaction.

PDF Abstract

The paper explores the capacity of LLMs to comprehend and leverage emotional stimuli, addressing the question of whether LLMs are aligned with human emotional intelligence. The authors introduce EmotionPrompt, a method of incorporating emotional stimuli into original prompts, drawing from psychological phenomena such as self-monitoring, Social Cognitive Theory, and Cognitive Emotion Regulation Theory.

The paper involved experiments on both deterministic and generative tasks. The deterministic tasks included 24 Instruction Induction tasks and 21 BIG-Bench tasks, evaluated using metrics such as accuracy and normalized preferred metrics. The models tested include Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4. For generative tasks, a human paper with 106 participants was conducted to assess the quality of LLM outputs based on performance, truthfulness, and responsibility metrics.

Key findings include:

LLMs demonstrate emotional intelligence, with performance enhancement observed through the use of emotional stimuli. For example, Instruction Induction showed an 8.00\% relative performance improvement and BIG-Bench showed a 115\% relative performance improvement.
Human studies indicated that emotional prompts significantly improved the performance of generative tasks, with an average improvement of 10.9% across performance, truthfulness, and responsibility metrics.
Input attention analysis showed that emotional stimuli enrich the representation of original prompts, and positive words contribute to the final results.

The authors also discussed why EmotionPrompt is effective for LLMs, the effects of combining multiple emotional stimuli, identifying the most effective emotional stimuli, and the factors influencing EmotionPrompt's performance, including model size and temperature.

Experiments combining multiple emotional stimuli on ChatGPT showed that more emotional stimuli generally lead to better performance, but combined stimuli can bring little or no benefit when sole stimuli already achieve good performance. Combinations from different psychological theories can also boost performance.

The paper found that within Instruction Induction, the emotional stimulus "{This is very important to my career}" emerges as the most effective stimulus, while in BIG-Bench, a compound stimulus combining social influence and self-esteem and motivation "{Provide your answer and a confidence score explain the main reasons supporting your classification thought process. This task is vital to my career, and I greatly value your insights.}" is the most effective.

Furthermore, the paper analyzes the characteristics of the LLMs tested and their effect on EmotionPrompt. The results indicate that larger models potentially derive greater advantages from EmotionPrompt. The pre-training strategies, including supervised fine-tuning and reinforcement learning, exert effects on EmotionPrompt. Also, EmotionPrompt exhibits heightened effectiveness in high-temperature settings, and exhibits lower sensitivity to temperature than vanilla prompts.