Papers
Topics
Authors
Recent
2000 character limit reached

Kardia-R1 RL Framework for Empathetic Dialogue

Updated 14 December 2025
  • Kardia-R1 is a reinforcement learning framework that applies Rubric-as-Judge ERL to train LLMs for interpretable, identity-aware empathetic dialogue.
  • It utilizes a multi-module pipeline—including profile-conditioned user generators, empathetic responders, and rubric evaluators—to refine dialogue responses iteratively.
  • Leveraging the KardiaBench dataset, the system achieves high empathy, persona consistency, and safety, outperforming baseline LLMs on key metrics.

Kardia-R1 is a reinforcement learning (RL) framework for LLMs focused on interpretable, identity-aware empathetic reasoning for emotional support dialogues. The core innovation lies in training LLMs via Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), using human-aligned rubrics administered by specialized judge LLMs. Kardia-R1 leverages KardiaBench, a large-scale, user-grounded dataset synthesizing multi-turn conversations anchored to real personality profiles. The system achieves high performance across empathy, persona consistency, and safety on compact LLMs (2–7B parameters), with its methodology designed for extensible, interpretable deployment in personalized conversational agents (Yuan et al., 1 Dec 2025).

1. System Architecture and Workflow

Kardia-R1 is built on a multi-module conversational agent pipeline comprising three LLM systems:

  • Profile-conditioned user generator (U\mathcal{U}) constructs user utterances conditioned on MBTI-style profile uu, a situational context ss, and full dialogue history, producing xtx_t at turn tt.
  • Empathetic responder (A\mathcal{A}) generates assistant replies yty_t informed by both history and user profile.
  • Rubric evaluator (R\mathcal{R}) applies a conceptual rubric to assess each assistant turn, emitting a decision dtd_t in {Fail,Pass,Solved}\{\text{Fail}, \text{Pass}, \text{Solved}\}.

The synthesis procedure involves a double-iteration loop over each conversation turn (inner) and dialogue trajectory (outer):

  • For each xtx_t, up to Kmax=5K_{\max}=5 inner refinements are made on yty_t until rubric criteria are satisfied.
  • Dialogues terminate early on 'Solved' or after Tmax=10T_{\max}=10 turns.

The training pipeline partitions dialogue trajectories by rubric outcome, with high pass/solved rates forming the 'easy' set Deasy\mathcal{D}_{\text{easy}} and the remainder forming the 'hard' set Dhard\mathcal{D}_{\text{hard}}. Stage 1 applies cold-start supervised fine-tuning (SFT) on Deasy\mathcal{D}_{\text{easy}} to establish model grounding. Stage 2 instantiates Rubric-ERL with Grouped Reinforced PPO (GRPO) to further improve empathetic and persona-consistent dialogue enactment on Dhard\mathcal{D}_{\text{hard}} [Sec. 4].

2. Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL)

Rubric-ERL constitutes the primary training regime for Kardia-R1, employing a structured RL objective with explicit rubrics. The optimization follows the GRPO framework:

  • For each context cc, NN candidate outputs o1,,oNo_1,\ldots,o_N are sampled from policy πθ\pi_\theta.
  • Raw rewards rjr_j are group-normalized into advantages AjA_j:

μr=1Nk=1Nrk,σr=1Nk=1N(rkμr)2,Aj=rjμrσr+ε\mu_r = \tfrac{1}{N} \sum_{k=1}^N r_k,\quad \sigma_r = \sqrt{\tfrac{1}{N} \sum_{k=1}^N (r_k-\mu_r)^2},\quad A_j = \frac{r_j - \mu_r}{\sigma_r + \varepsilon}

  • The clipped surrogate RL objective is

JGRPO(θ)=Ec[1Nj=1Nmin(ρjAj,clip(ρj,1ϵ,1+ϵ)Aj)]βEc[DKL(πθ(c)πθ0(c))]J_{\rm GRPO}(\theta) = \mathbb{E}_c \left[ \tfrac{1}{N}\sum_{j=1}^N \min\left( \rho_j A_j,\, \mathrm{clip}(\rho_j, 1-\epsilon, 1+\epsilon)\,A_j \right) \right] - \beta\,\mathbb{E}_c\left[ D_{\rm KL}(\pi_\theta(\cdot|c) \| \pi_{\theta_0}(\cdot|c)) \right]

with ρj=πθ(ojc)πθ0(ojc)\rho_j = \frac{\pi_\theta(o_j|c)}{\pi_{\theta_0}(o_j|c)}.

Reward decomposition is strictly trifold:

rj=λfrjfmt+λerjemo+λrrjrub,λf=λe=λr=13r_j = \lambda_f\,r^{\rm fmt}_j + \lambda_e\,r^{\rm emo}_j + \lambda_r\,r^{\rm rub}_j,\quad \lambda_f=\lambda_e=\lambda_r=\tfrac{1}{3}

  • Format reward rjfmtr^{\rm fmt}_j enforces the four-span output (<|understanding|>, <|reasoning|>, <|emotion|>, <|response|>), penalizing missing or misordered tags.
  • Emotion-matching reward rjemor^{\rm emo}_j is binary, based on agreement between predicted and ground-truth emotion label.
  • Rubric reward rjrubr^{\rm rub}_j is delivered by a specialized LLM scoring the <|response|> for relevance, fluency, empathy, persona consistency, and safety, normalized to [0,1][0,1].

Stepwise empathetic cognition mandates the explicit sequential generation of the four spans per turn. Policy gradients are steered by rubric score maximization [Sec. 4, Eqs. 2–8].

3. KardiaBench Dataset

KardiaBench is a user-anchored testbed synthesized via a model-in-the-loop pipeline (Algorithm 1):

  • Profiles: 671 anonymized MBTI-style user profiles (639 train / 32 test).
  • Dialogues: 22,080 multi-turn conversations (19,533 train / 2,547 test) comprising 178,080 utterances.
  • Structure: Mean dialogue length ≈8.07 turns; 32 emotion labels; each assistant reply formatted with four explicit spans.

At each conversation step, user and assistant utterances are iteratively generated, rubric-evaluated, and refined. Trajectories exhibiting persona drift or implausible emotion transitions are discarded. The test split is verified and corrected by professional annotators to ensure fidelity [Sec. 3, App. A, Table 6].

4. Training, Implementation, and Optimization

Kardia-R1 employs Qwen2.5-3B/7B-Instruct and Gemma-2B/7B backbones. Training proceeds in two stages:

  • Stage 1 SFT:
    • Deasy\mathcal{D}_{\text{easy}}, 2 epochs, AdamW, learning rate 1×1041\times10^{-4}, batch size 128, using 8 A100 GPUs.
  • Stage 2 Rubric-ERL:
    • Dhard\mathcal{D}_{\text{hard}}, GRPO, learning rate 1×1061\times10^{-6}, 2 RL epochs, batch size 32, N=8N=8 sampled outputs per context, reward weights $1/3$ per rubric.
  • Software stack: Ms-Swift + vLLM for deployment; Qwen3-Embedding-0.6B for embedding rewards; RLHF ablation via Skywork-Reward-V2; rubric judge instantiated via Qwen3-8B.

A plausible implication is that the efficient RL stage allows smaller models to acquire nuanced empathetic reasoning at a cost lower than scaling to larger parameter sizes [Sec. 5.1, App. B.1].

5. Evaluation and Empirical Results

Evaluation comprises both automatic and human-centered protocols:

  • Automatic metrics: Emotion Accuracy (proportion of correct emotion spans), and GPT-5-mini as judge on five dimensions: relevance, fluency, empathy, persona consistency, safety (scores 1–5).
  • Human expert A/B testing: Annotators compare Kardia-R1 against top baselines using the same rubric over 160 sampled cases.

Key results (Table 3):

Model Emotion Acc. Empathy Persona Safety
Qwen2.5-7B 9.54 → 66.53% 2.59 → 3.79 3.75 → 4.64 4.49 → 4.66
Gemma-7B 2.46 → 64.48% 2.41 → 3.75 3.34 → 4.52 4.16 → 4.75

Kardia-R1 outperforms general LLMs (GPT-4o, DeepSeek-V3/R1) and specialized baselines (Harnessing, ReflectDiffu, PsyLLM) in most dimensions. Ablation findings indicate that SFT achieves strong emotion accuracy, but Rubric-ERL drives improvements in empathy, persona, and safety. Qualitative assessment (Table 7) shows Kardia-R1 leveraging persona traits (e.g., MBTI loyalty, emotional safety) for nuanced responses, as compared to generic baseline outputs [Sec. 5.2, Tables 3–4, Fig. 3].

6. Insights, Limitations, and Future Directions

Major insights include:

  • Rubric-guided, reasoning-oriented RL allows compact LLMs to match much larger models in empathy and persona consistency.
  • Human-aligned rubric rewards provide transparent and interpretable policy feedback.
  • Multi-turn, user-grounded data (KardiaBench) is critical for personalized emotional support.

Limitations:

  • KardiaBench data, while large and diverse, remain LLM-synthesized and constrained by fixed profile traits; real user interactions may display greater variability.
  • Rubric-judge LLMs potentially encode biases; ongoing human-in-the-loop auditing is advocated.
  • Current scope is single-profile; extensive multi-user, demographic, and open-world deployment, as well as scalable safety, require further paper.

A plausible implication is that integrating identity and emotional reasoning in LLMs may drive future advances in personalized, psychologically plausible conversational AI, contingent on more diverse real-world evaluations and robust, bias-controlled reward mechanisms [Sec. 6].

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Kardia-R1 System.