Kardia-R1 RL Framework for Empathetic Dialogue
- Kardia-R1 is a reinforcement learning framework that applies Rubric-as-Judge ERL to train LLMs for interpretable, identity-aware empathetic dialogue.
- It utilizes a multi-module pipeline—including profile-conditioned user generators, empathetic responders, and rubric evaluators—to refine dialogue responses iteratively.
- Leveraging the KardiaBench dataset, the system achieves high empathy, persona consistency, and safety, outperforming baseline LLMs on key metrics.
Kardia-R1 is a reinforcement learning (RL) framework for LLMs focused on interpretable, identity-aware empathetic reasoning for emotional support dialogues. The core innovation lies in training LLMs via Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), using human-aligned rubrics administered by specialized judge LLMs. Kardia-R1 leverages KardiaBench, a large-scale, user-grounded dataset synthesizing multi-turn conversations anchored to real personality profiles. The system achieves high performance across empathy, persona consistency, and safety on compact LLMs (2–7B parameters), with its methodology designed for extensible, interpretable deployment in personalized conversational agents (Yuan et al., 1 Dec 2025).
1. System Architecture and Workflow
Kardia-R1 is built on a multi-module conversational agent pipeline comprising three LLM systems:
- Profile-conditioned user generator () constructs user utterances conditioned on MBTI-style profile , a situational context , and full dialogue history, producing at turn .
- Empathetic responder () generates assistant replies informed by both history and user profile.
- Rubric evaluator () applies a conceptual rubric to assess each assistant turn, emitting a decision in .
The synthesis procedure involves a double-iteration loop over each conversation turn (inner) and dialogue trajectory (outer):
- For each , up to inner refinements are made on until rubric criteria are satisfied.
- Dialogues terminate early on 'Solved' or after turns.
The training pipeline partitions dialogue trajectories by rubric outcome, with high pass/solved rates forming the 'easy' set and the remainder forming the 'hard' set . Stage 1 applies cold-start supervised fine-tuning (SFT) on to establish model grounding. Stage 2 instantiates Rubric-ERL with Grouped Reinforced PPO (GRPO) to further improve empathetic and persona-consistent dialogue enactment on [Sec. 4].
2. Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL)
Rubric-ERL constitutes the primary training regime for Kardia-R1, employing a structured RL objective with explicit rubrics. The optimization follows the GRPO framework:
- For each context , candidate outputs are sampled from policy .
- Raw rewards are group-normalized into advantages :
- The clipped surrogate RL objective is
with .
Reward decomposition is strictly trifold:
- Format reward enforces the four-span output (<|understanding|>, <|reasoning|>, <|emotion|>, <|response|>), penalizing missing or misordered tags.
- Emotion-matching reward is binary, based on agreement between predicted and ground-truth emotion label.
- Rubric reward is delivered by a specialized LLM scoring the <|response|> for relevance, fluency, empathy, persona consistency, and safety, normalized to .
Stepwise empathetic cognition mandates the explicit sequential generation of the four spans per turn. Policy gradients are steered by rubric score maximization [Sec. 4, Eqs. 2–8].
3. KardiaBench Dataset
KardiaBench is a user-anchored testbed synthesized via a model-in-the-loop pipeline (Algorithm 1):
- Profiles: 671 anonymized MBTI-style user profiles (639 train / 32 test).
- Dialogues: 22,080 multi-turn conversations (19,533 train / 2,547 test) comprising 178,080 utterances.
- Structure: Mean dialogue length ≈8.07 turns; 32 emotion labels; each assistant reply formatted with four explicit spans.
At each conversation step, user and assistant utterances are iteratively generated, rubric-evaluated, and refined. Trajectories exhibiting persona drift or implausible emotion transitions are discarded. The test split is verified and corrected by professional annotators to ensure fidelity [Sec. 3, App. A, Table 6].
4. Training, Implementation, and Optimization
Kardia-R1 employs Qwen2.5-3B/7B-Instruct and Gemma-2B/7B backbones. Training proceeds in two stages:
- Stage 1 SFT:
- , 2 epochs, AdamW, learning rate , batch size 128, using 8 A100 GPUs.
- Stage 2 Rubric-ERL:
- , GRPO, learning rate , 2 RL epochs, batch size 32, sampled outputs per context, reward weights $1/3$ per rubric.
- Software stack: Ms-Swift + vLLM for deployment; Qwen3-Embedding-0.6B for embedding rewards; RLHF ablation via Skywork-Reward-V2; rubric judge instantiated via Qwen3-8B.
A plausible implication is that the efficient RL stage allows smaller models to acquire nuanced empathetic reasoning at a cost lower than scaling to larger parameter sizes [Sec. 5.1, App. B.1].
5. Evaluation and Empirical Results
Evaluation comprises both automatic and human-centered protocols:
- Automatic metrics: Emotion Accuracy (proportion of correct emotion spans), and GPT-5-mini as judge on five dimensions: relevance, fluency, empathy, persona consistency, safety (scores 1–5).
- Human expert A/B testing: Annotators compare Kardia-R1 against top baselines using the same rubric over 160 sampled cases.
Key results (Table 3):
| Model | Emotion Acc. | Empathy | Persona | Safety |
|---|---|---|---|---|
| Qwen2.5-7B | 9.54 → 66.53% | 2.59 → 3.79 | 3.75 → 4.64 | 4.49 → 4.66 |
| Gemma-7B | 2.46 → 64.48% | 2.41 → 3.75 | 3.34 → 4.52 | 4.16 → 4.75 |
Kardia-R1 outperforms general LLMs (GPT-4o, DeepSeek-V3/R1) and specialized baselines (Harnessing, ReflectDiffu, PsyLLM) in most dimensions. Ablation findings indicate that SFT achieves strong emotion accuracy, but Rubric-ERL drives improvements in empathy, persona, and safety. Qualitative assessment (Table 7) shows Kardia-R1 leveraging persona traits (e.g., MBTI loyalty, emotional safety) for nuanced responses, as compared to generic baseline outputs [Sec. 5.2, Tables 3–4, Fig. 3].
6. Insights, Limitations, and Future Directions
Major insights include:
- Rubric-guided, reasoning-oriented RL allows compact LLMs to match much larger models in empathy and persona consistency.
- Human-aligned rubric rewards provide transparent and interpretable policy feedback.
- Multi-turn, user-grounded data (KardiaBench) is critical for personalized emotional support.
Limitations:
- KardiaBench data, while large and diverse, remain LLM-synthesized and constrained by fixed profile traits; real user interactions may display greater variability.
- Rubric-judge LLMs potentially encode biases; ongoing human-in-the-loop auditing is advocated.
- Current scope is single-profile; extensive multi-user, demographic, and open-world deployment, as well as scalable safety, require further paper.
A plausible implication is that integrating identity and emotional reasoning in LLMs may drive future advances in personalized, psychologically plausible conversational AI, contingent on more diverse real-world evaluations and robust, bias-controlled reward mechanisms [Sec. 6].