Chains-of-Affect: Affective Reasoning in Dialogue
- Chains-of-Affect is a framework that structures dialogue into sequential emotional appraisals and cause extraction based on appraisal theory.
- The approach employs a memory system, reasoning engine, and knowledge integration (e.g., COMET) to condition empathetic and coherent responses.
- Empirical evaluations show significant improvements in emotional understanding and response authenticity over traditional prompting methods.
Chains-of-Affect refers to a class of architectures and prompting strategies that guide LLMs to explicitly reason about affective states in dialogue, by sequentially inferring emotions, their antecedents (causes), and leveraging internal and external knowledge to condition empathetic response generation. This paradigm operationalizes chains of emotional inference—rooted in appraisal theory—within the generative process, yielding measurable improvements in the believability, coherence, and affective intelligence of artificial agents in both open-domain and task-specific scenarios (Chen et al., 2024, Croissant et al., 2023).
1. Theoretical Foundations and Formal Definitions
Chains-of-Affect (also called Chain-of-Emotion in appraisal-theoretic contexts) formalizes affective reasoning as a temporally-ordered sequence of appraisal and emotion states maintained as explicit, interpretable elements within the system's memory (Croissant et al., 2023). In appraisal theory, an agent's emotion emerges from evaluating the relevance of an event to its goals, agency, norms, and coping capacity. Chains-of-Affect extends this by having the model:
- Assess the current context and assign an emotion to a dialogue turn.
- Identify a cause span that justifies the inferred emotional state.
- Optionally, annotate with external commonsense or situational knowledge relevant to the cause.
- Chain these appraisals and responses in agent memory, conditioning subsequent actions or dialogue.
In "Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning," the empathetic response generation task is formally decomposed as follows: given dialogue history , the model must infer and generate a response demonstrating understanding of both (Chen et al., 2024). Training optimizes cross-entropy loss over the output tokens, which interleave explicit reasoning ("He feels ... because he says ...") and the final empathetic reply:
where is the chain-of-thought prompt comprising instruction, context, demonstrations, and external knowledge.
2. Architectural Components and Methodologies
Chains-of-Affect architectures consist of the following interacting modules:
- Memory System: Maintains agent profile, dialogue history, and sequential appraisals ("chain-of-affect"). Each step appends an explicit emotion-cause mapping.
- Appraisal/Reasoning Engine: After each observation (e.g., new utterance), the model is prompted to generate a natural-language appraisal—explicitly stating the current emotion and its justification.
- Knowledge Augmentation: Incorporates external knowledge—e.g., via COMET—about likely intents, needs, wants, effects, and reactions, conditioned not on superficial utterances but on extracted cause spans.
- Response Generator: Generates agent utterances conditioned on the current chain of appraisals, knowledge, and prescribed persona/role.
In Chains-of-Affect applied to LLM empathetic response (Chen et al., 2024), prompts have four parts: a cause-aware instruction, -shot demonstrations, the current dialogue context , and concatenated cause-oriented external knowledge from COMET covering five relations (xIntent, xNeed, xWant, xEffect, xReact).
The output template is always structured as "He feels {emo} because he says {cau}. I'm {emo'} to hear that. I will {Intent} him: {response}."
3. Instruction Tuning and Inference Procedure
Cause-aware empathetic response generation via Chains-of-Affect is realized by instruction tuning LLaMA-7b (with LoRA adapters, rank 8) on annotated EmpatheticDialogue data (Chen et al., 2024). The key steps are:
- Each dialogue is annotated with using a RECCON-fine-tuned LLaMA-7b for emotion-cause pair extraction (macro-F = 74.16%).
- For training, five -shot examples are sampled, and for each cause span the COMET model is queried for five cause-oriented knowledge relations.
- The full prompt is assembled as described above.
- The model is optimized using cross-entropy over the generated sequence.
On each inference turn, the flow is:
- User utterance appended to memory.
- Appraisal prompt (explicitly requesting both emotion and its cause) is presented.
- LLM produces the reasoning step and empathetic reply as a joined output.
- The sequence is updated with the latest affective state and response.
4. Cause Reasoning and Knowledge Fusion
Appraisal theory-driven inference is supplemented with two mechanisms:
- Internal Inference: Emotion-cause extraction is performed by a separate model and provided explicitly for CoT prompting. The prompt design specifically elicits outputs of the form "He feels {emo} because he says {cau}."
- External Knowledge Fusion: Instead of querying commonsense models like COMET on raw utterances (which may yield contextually irrelevant knowledge), the extracted cause span is used. For each of , the model obtains a natural-language snippet from COMET. These knowledge snippets are concatenated and appended to the CoT prompt, aligning the generation contextually and improving both coherence and diversity.
This approach enforces tight alignment across the model's internal and external representations of affective causality.
5. Prompting Schemes and End-to-End Examples
Chains-of-Affect relies on explicitly structured prompt-response schemes. The prompt template integrates instruction, demonstrations, dialogue, and causally-oriented knowledge. The output follows with canonical CoT and response segments.
Two illustrative examples from (Chen et al., 2024):
- Example 1: Speaker describes burning their hair, prompting the cause extraction " = 'I burned my hair with my hair dryer.'" COMET yields intent ("to look nice") and reaction ("feels embarrassed"). The system's response: "He feels embarrassed because he says 'I burned my hair with my hair dryer.' I'm sorry to hear that. I will reassure him: I understand how upsetting this must be, but it's only temporary—your hair will grow back. Perhaps a fresh haircut could help you feel more confident."
- Example 2: Prolonged home repairs result in cause " = 'He dismantled the bathroom to do repairs and still hasn't completed them six months later!'" The response reflects targeted empathy and practical advice.
In (Croissant et al., 2023), the chain-of-affect loop is similarly realized: after each player input in a game context, the agent performs appraisal (generating a natural-language summary of "how I feel now, and why"), and then generates a response conditioned on the entire chain of prior appraisals and current context.
6. Empirical Results and Evaluation
Empirical evaluation demonstrates that Chains-of-Affect architectures yield state-of-the-art results on both automatic and human metrics:
- Automatic metrics (EmpatheticDialogue):
- Emotion accuracy (ACC): 52.73% (CFEG) vs. 48.74% (ChatGPT+CoT).
- BLEU-2/4: 10.54/5.17 vs. 4.99/1.37.
- Dist-1/2: 2.96/19.52 vs. 2.48/16.90.
- Human evaluation (Likert scale, 1–5, 200 contexts):
| Model | Coh. | Emp. | Inf. | Flu. |
|---|---|---|---|---|
| LLaMA-7b +CoT | 3.32 | 3.70 | 3.23 | 4.09 |
| ChatGPT+CoT | 4.23 | 4.30 | 4.12 | 4.65 |
| CFEG (Chains-of-Affect) | 4.32 | 4.51 | 4.51 | 4.49 |
Pairwise preference in A/B testing favors Chains-of-Affect over ChatGPT+CoT: for coherence (54.0% vs. 38.0%), empathy (55.0% vs. 38.5%), and informativeness (53.5% vs. 37.0%) (Chen et al., 2024).
In interactive game-agent settings (Croissant et al., 2023):
- Emotional understanding via appraisal-prompting reaches 83% accuracy on STEU tasks, compared to 57% with conventional prompting.
- Authenticity scores on fixed content analysis are significantly higher for chain-of-emotion agents (82.6 vs. ≈61).
- User studies indicate better ratings for sensitivity, naturalness, and adaptive reaction.
7. Broader Implications, Limitations, and Future Directions
Chains-of-Affect architectures ground affective dialogue in interpretable, sequential appraisals, bridging cognitive emotion theory and in-context LLM learning. This explicit decomposition supports transparency and improved emotional intelligence beyond latent autoregressive behaviors.
Noted limitations include:
- Appraisal depth and dimensions remain implicit; the models do not report which appraisal dimensions are used.
- The approach has so far been tested on manageable context lengths; scaling may provoke memory and retrieval bottlenecks.
- Few models outside GPT-3.5-turbo and LLaMA-7b have been systematically evaluated.
Promising directions include:
- Decomposing appraisal further with prompts targeting individual dimensions (goal relevance, agency, etc.).
- Incorporating retrieval-augmented memory for longer affective histories.
- Linking nonverbal behaviors by mapping chain-of-affect outputs to agent animations in multimodal environments.
By surfacing affective reasoning as explicit, contextual chains, Chains-of-Affect methods represent a significant step in the development of interpretable, cognitively grounded affective agents (Chen et al., 2024, Croissant et al., 2023).