Papers
Topics
Authors
Recent
2000 character limit reached

MEDIC: Multimodal Empathy Dataset in Counseling

Updated 13 January 2026
  • MEDIC is a multimodal empathy dataset comprising synchronized video, audio, and text recordings of counseling sessions, annotated for client self-disclosure (EE) and counselor reactions (ER, CR).
  • It features a multidimensional annotation schema with high interrater reliability (Fleiss’ κ ≈0.7; ICC >0.8), enabling rigorous benchmarking of empathic dynamics in therapeutic discourse.
  • Baseline models (TFN, SWAFN, and simple LSTM fusion) demonstrate that integrating multimodal data improves empathy prediction performance, highlighting the dataset’s research potential.

The Multimodal Empathy Dataset in Counseling (MEDIC) is a publicly available corpus comprising synchronized multimodal recordings and annotations of face-to-face counselor–client interactions, created to rigorously quantify and benchmark empathy in real-world psychotherapeutic discourse. MEDIC provides high-quality data covering video, audio, and text modalities, with each interaction labeled along three dimensions reflecting both client self-disclosure and counselor empathic response. This resource addresses a critical gap in computational psychotherapy research by enabling the systematic analysis and automated modeling of empathic processes in counseling sessions (Zhou'an_Zhu et al., 2023).

1. Dataset Composition and Acquisition

MEDIC is constructed from educational re-enactments of real counseling cases, each performed by professional counselors on 10 discrete topics (marital discord, career dilemmas, family education, meaning of life, etc.). The corpus comprises 38 videos totaling 678 minutes (≈11 hours), with an average session lasting 17 minutes and 50 seconds. Each video is segmented into 771 "samples," corresponding to a client speaking turn immediately followed by a counselor response at a full "empathy cycle" granularity. Samples are not subdivided on vocalized pauses, ensuring semantically coherent dialog units.

Multimodal capture includes:

  • Video: Frame-level pose data extracted via OpenPose, yielding 411-dimensional feature vectors (face, torso, hand keypoints) per frame.
  • Audio: Dual-channel stereo recording enables clean separation of speaker contributions. MFCC features (20 dimensions) are extracted at ≈100 Hz using librosa.
  • Text: Speech recognition (Wondershare UniConverter) and subsequent manual proofreading result in highly accurate, turn-aligned transcripts. Text is embedded with a Chinese BERT model (768 dimensions per token).

All personally identifying information is explicitly redacted, and only abstracted keypoint data and ASR transcripts are made available for research (Zhou'an_Zhu et al., 2023).

2. Multidimensional Annotation Schema

Empathic interaction is annotated at each sample (client turn + counselor response) using three interrelated scales:

  • Expression of Experience (EE; client): Captures the extent of client self-disclosure that could evoke empathy, on a 3-point ordinal scale (0 = none, 1 = weak, 2 = strong).
  • Emotional Reaction (ER; counselor): Measures affective resonance via verbal and nonverbal markers (0 = none, 1 = weak). Overtly strong emotional displays did not occur in the data, resulting in a binary scale.
  • Cognitive Reaction (CR; counselor): Assesses reflection, interpretation, or exploration of client content (0 = none, 1 = weak prompt, 2 = strong/exploratory response).

Five expert annotators (advanced counseling students) label each sample via video, audio, and transcript, following a three-stage instruction protocol. Dual annotations are reconciled by a third adjudicator. Interrater reliability demonstrates strong agreement—Fleiss' κ is 0.756 (EE), 0.699 (ER), and 0.710 (CR); ICC > 0.8 for all dimensions, indicating high consistency and annotation quality (Zhou'an_Zhu et al., 2023).

3. Dataset Statistics and Empathic Dynamics

Quantitative analysis reveals that each turn comprises on average 4.29 speech segments and 129.5 words, with mean durations of 52.8 s (audio) and 1,137 video frames. Clients tend to contribute longer utterances (mean 11.5 s) than counselors (6.6 s). The distribution of Empathy Cycle labels is as follows:

Dimension 0 1 2
EE (client) 24% 32% 44%
ER (counselor) 60% 40% —
CR (counselor) 25% 34% 41%

Pearson correlation between ER and CR is 0.45, indicating moderate but non-identical constructs. Marital topic sessions are associated with extended client disclosure and balanced ER/CR, while career dilemmas prompt the most cognitive responses from counselors.

4. Baseline Modeling and Empathy Prediction

MEDIC supports multimodal empathy prediction tasks using synchronized video, audio, and text input streams. Three neural models are benchmarked:

  • Tensor Fusion Network (TFN): Employs unimodal subnets and an outer-product fusion architecture to explicitly model all uni-, bi-, and tri-modal feature interactions.
  • Sentimental Words Aware Fusion Network (SWAFN): Integrates cross-modal co-attention and sentiment-word auxiliary guidance, emphasizing textual features most related to empathy.
  • Simple Concatenation Model: Uni-directional LSTM encodings for each modality are concatenated for classification.

Evaluation is conducted with a 7:1:2 train:validation:test split. Macro-F1 scores for SWAFN (multimodal) on the test set are: EE = 0.863, ER = 0.743, CR = 0.785. Textual features alone are most informative, but multimodal fusion yields consistent, though moderate, performance gains for all tasks. Audio features particularly aid in detecting emotional reactions, while video—though weaker alone—yields a 2–4% F1 boost when fused with text and audio (Zhou'an_Zhu et al., 2023).

5. Significance, Limitations, and Recommendations

MEDIC constitutes the first publicly available, tri-modal, finely annotated corpus of face-to-face counseling sessions with explicit empathy mechanism labeling. Its high inter-annotator reliability (ICC > 0.8; κ ≈ 0.7) and robust baseline evaluation establish it as a standard computational resource for analyzing, modeling, and ultimately enhancing empathic behavior in therapeutic agents.

Limitations include relatively limited sample size (771 samples) and label imbalance (e.g., few "strong" emotional reactions), restricting deep model generalizability. Automated models may overemphasize utterance length or simple prosodic features, potentially missing nuanced empathic dynamics. Incorporating hierarchical dialog context and pre-training on larger corpora are proposed directions for improving performance (Zhou'an_Zhu et al., 2023).

Comparison to subsequent resources, such as E-THER, underscores differences in theoretical grounding and annotation depth. MEDIC annotates expressions of experience and counselor reactions but does not systematically label verbal–visual incongruence, engagement, or valence–arousal–dominance dimensions, which have been introduced in more granular, theory-driven datasets like E-THER (Tahir et al., 2 Sep 2025). A plausible implication is that combining the structured empathy scales of MEDIC with incongruence- and engagement-focused schemas may enable richer modeling of authentic empathic understanding.

6. Future Expansions and Integration with Multimodal Therapy Research

Recommendations for advancing MEDIC, grounded in design experiences from adjacent multimodal counseling datasets (such as EMMI), include:

  • Expanding coverage beyond video, audio, and text to include physiological measures (eye-tracking, heart rate, skin conductance, fMRI) for affective state modeling.
  • Adding new annotation layers: global session ratings (therapeutic alliance, overall empathy), fine-grained empathy ratings per turn, and markers for cultural or linguistic context.
  • Standardizing codebooks and releasing curated toolkits for scalable, reproducible annotation.
  • Structured training/calibration of annotators, inclusion of role-played edge cases (e.g., crisis interventions).
  • Open publication of feature-aligned, multimodal datasets and APIs supporting downstream research (Galland et al., 2024).

Adherence to these recommendations would make MEDIC a richer, extensible foundation for adaptive, empathic virtual therapists and empirical inquiry into the mechanisms of psychotherapy.

7. Relation to the Broader Multimodal Empathy Dataset Landscape

MEDIC bridges the empirical gap between raw behavioral counseling recordings and structurally annotated empathy corpora. It is methodologically distinct from datasets such as EMMI—which employs a hierarchical labeling of motivational interviewing acts and social behaviors, supports clustering of patient "types," and enables therapeutic adaptation modeling—and E-THER, which foregrounds person-centered therapy (PCT) constructs, records verbal–visual incongruence, and explicitly rates engagement and affect. The following table contrasts key properties:

Dataset Modality Count Empathy Dimensions Incongruence Marking Theory-Grounded Labels
MEDIC 3 EE, ER, CR No No unified theory
E-THER 3 5 (incl. VAD, engagement, incongruence) Yes PCT (Rogers)
EMMI 3+ MI + Social/Empathic No MI theory, patient types

This landscape situates MEDIC as an essential empirical anchor for computational empathy, while also indicating the need for ongoing enhancement of both annotation depth and theoretical integration (Zhou'an_Zhu et al., 2023, Galland et al., 2024, Tahir et al., 2 Sep 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multimodal Empathy Dataset in Counseling (MEDIC).