Personality Recognition in Conversations

Updated 24 March 2026

PRC is the computational task of inferring the Big Five personality traits from dialogues using supervised and unsupervised learning approaches.
It employs annotated multimodal datasets and advanced neural architectures such as BERT, GNNs, and transformers to model context and speaker dynamics.
Recent research emphasizes explainability and multimodal fusion, which improve trait prediction accuracy and support adaptive conversational systems.

Personality Recognition in Conversations (PRC) refers to the computational task of inferring stable psychological personality dimensions, most commonly the Big Five (OCEAN: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism), from stretches of spoken or written conversational data. Conversational personality recognition is foundational for dialogue systems, user modeling, and computational behavioral science, providing the backbone for applications such as adaptive conversational agents, human-robot interaction, and large-scale social signal processing.

1. Conceptual Foundations and Formal Task Definition

PRC is generally framed as a supervised or unsupervised learning problem: given input in the form of dialogues—possibly monologues, dyads, or multiparty conversations—the system predicts a per-speaker vector of trait values. While traditionally “personality recognition” in NLP was based on isolated texts (e.g., essays, social-media posts), Jiang et al. were the first to explicitly formalize PRC in the context of multi-turn dialogues, defining the task as predicting, for each stretch of text or sub-scene, a set of binary (or continuous) values for the Big Five (Jiang et al., 2019).

Let $D = (u_1, u_2, \ldots, u_n)$ denote a sequence of utterances in a conversation, with annotated speakers $s_i$ . The task is to learn, for each target speaker $s$ , the mapping:

$PRC: \{ D \mid s \} \to (y_{AGR}, y_{CON}, y_{EXT}, y_{OPN}, y_{NEU}),$

where $y_k \in \{0,1\}$ (binary) or $[0,1]$ (continuous/regression).

Recent advances formalize PRC not only as direct trait prediction but also as a reasoning chain from short-term personality “states” to long-term “traits,” with supporting evidence extracted from local utterances (Sun et al., 2024).

2. Datasets and Annotation Protocols

Significant progress in PRC is due to the development of annotated conversational corpora across multiple modalities and cultures:

FriendsPersona: The first dialogue-based English PRC corpus, constructed from Friends TV-series scripts using a sliding-window MSF algorithm to segment sub-scenes, with three independent raters labeling each trait per dialogue using a seven-point scale; labels obtained via median split (Jiang et al., 2019).
CPED: Chinese Personalized and Emotional Dialogue Dataset featuring 11.8k dialogues from 392 TV characters, every trait annotated as high/low per speaker by expert psychologists according to the BFI-2 inventory, with three-way majority voting and multimodal alignment (text, audio, video) (Chen et al., 2022).
Vyaktitv: Peer-to-peer Hindi dyads with full audiovisual capture and self-report Big Five, richly annotated for socio-demographic profiles and code-switched Hinglish phenomena (Khan et al., 2020).
PersonalityEvd: Constructed around the Chain-of-Personality-Evidence protocol for explainable PRC, includes utterance-level chain-of-thought evidence for states and traits per Big Five facet (Sun et al., 2024).
Personality in Speech (e.g., PersonaTAB, Teams corpus): Telephonic, multiparty, or dyadic settings with synchronized audio, turn-taking and emotion/backchannel annotation, and human ratings for personality alignment (Inoue et al., 20 May 2025, Yu et al., 2019).

Inter-annotator agreement in these corpora varies; e.g., FriendsPersona reports average pairwise Cohen’s $\kappa$ ≈ 0.55, Fleiss’ $\kappa$ ≈ 0.21, consistent with the subjectivity of text-based trait inference (Jiang et al., 2019). CPED uses expert review but does not report $\kappa$ (Chen et al., 2022).

3. Modeling Approaches: Architectures, Features, and Modalities

3.1 Text-Based Models

Early PRC models used shallow classifiers over lexical–psycholinguistic features (LIWC, BoW), but modern systems overwhelmingly rely on neural architectures:

Contextual Embeddings + Attention: Fine-tuned BERT/RoBERTa encoders, optionally with token- or utterance-level additive attention, provide strong text-only baselines. RoBERTa improves the state of the art by 2.49% on average for monologue settings, and exhibits robust performance for dialogue (Jiang et al., 2019, Chen et al., 2022).
Hierarchical/Utterance Structures: Hierarchical CNN/LSTM or Hierarchical Attention Networks (HANs) explicitly model utterance and speaker order; transformers with hierarchical attention are advocated for modeling “who said what when” (Jiang et al., 2019).
Sequential and Dyadic Contexts: Augmented GRU architectures inject both speaker and interlocutor embeddings at each gate, with unsupervised learning of Personal Conversational Embeddings (PCEs) for downstream trait inference. This dyadic context modeling yields +4.3% overall accuracy over prior methods (Liu, 2020).

3.2 Conversational Structure and Graph Neural Networks

Overlap Dynamics: Behavioral features such as overlap counts (interruptive, non-interruptive) significantly correlate with Extraversion and, to a lesser degree, Agreeableness, as shown by ANOVA and improved classifier F₁ over baseline (Yu et al., 2019).
Heterogeneous Conversational Graph Neural Networks (HC-GNN): Distinguish intra- from inter-speaker relations, employing separate graph attention and convolution per edge type, then fusing via self-attention. Monologue accuracy increased from 57.4% to 61.2%, and dialogue-level accuracy from 57.1% to 60.2% with synthetic data (Fu et al., 2024).

3.3 Multimodal and Behavioral Integration

Audio-Visual Fusion: Person-specific CNNs learned via Neural Architecture Search on speaker’s facial reactions, given the partner’s audio-visual cues. Extracted graphs encode cognitive individuality, providing high personality-prediction accuracy (ACC 0.92; PCC 0.35) (Song et al., 2021).
Speech/Paralinguistic Features: Handcrafted acoustic features (eGeMAPS) and non-verbal behaviors (head nods, turn-taking) outperform speaker embeddings; loudness, spectral flux, and pause statistics robustly correlate with Extraversion, Agreeableness, and Conscientiousness, but under stress, only Neuroticism correlates (Zhang et al., 25 Jul 2025).
Prompt-Based LLMs: PersonaTAB fuses turn-taking, laughter rate, backchannel types, and automated emotion/sentiment tags via prompt-based GPT-4o inferencing, providing higher correlation to human labels than text-focused systems (Inoue et al., 20 May 2025).

3.4 Early-Stage and Explainable Recognition

Unsupervised and Early-Turn Prediction: Unsupervised clustering of code-mixed dialogue encodings for dominant trait assignment, validated against expert annotation ( $\kappa$ = 0.78, per-speaker accuracy >50%) (Kumar et al., 2024). Affective-NLI hybridizes text+emotion templates and NLI over trait descriptions, yielding up to 6% higher accuracy, and offering substantial gains in few-turn inference (22%–34% over baselines at 25% dialogue length) (Wen et al., 2024).
Explainable Reasoning via CoPE: Chain-of-Personality-Evidence (CoPE) structures the task as $c \rightarrow s \rightarrow t$ (contexts-to-states-to-traits), with explicit chain-of-thought support at both levels. PersonalityEvd benchmarks LLMs for evidence extraction and trait inference, reporting EPR-S state accuracy (avg. 66.45%) and trait (EPR-T) accuracy (76.59%–77.78%) for top finetuned models (Sun et al., 2024).

4. Evaluation Protocols and Key Empirical Results

Table: Select PRC Benchmarks (Binary Trait Prediction)

Corpus	Model	Acc / Macro F1	Notable Findings
FriendsPersona	RoBERTa	59.7% / --	Outperforms baselines by +2.49% avg.; S-only input best (Jiang et al., 2019)
CPED	BERT_ssenet^c	67.25% / 74.08	Extraversion and Agreeableness easiest; context helps Neuroticism (Chen et al., 2022)
Teams corpus	Naive Bayes	F₁ = 0.56*	Overlap features boost Extraversion F₁ by 14% rel. over baseline (Yu et al., 2019)
RealPersonaChat	HC-GNN (+aug)	60.2% / --	Relation-specific graphs outperform GCN by 2–3%; data augmentation raises acc. by 4% (Fu et al., 2024)
PersonaTAB	GPT-4o-prompt	Corr. 0.18	Stronger alignment to human ratings than BERT or MiniLM (Inoue et al., 20 May 2025)
PersonalityEvd	Qwen-32k (EPR-T)	76.59%	CoT evidence boosts Acc. VS direct finetuning; evidence F1 up to 77.09 (Sun et al., 2024)

Qualitative findings include trait-dependent error sources: Conscientiousness is hardest in multiparty dialogue due to weak conversational markers; Openness and Neuroticism show more annotation “uncertain” states and lower model performance (Jiang et al., 2019, Sun et al., 2024). For English, text-based models dominate; for Chinese and Hindi conversational corpora, multilingual models and culture-adapted pipelines are under active investigation (Chen et al., 2022, Khan et al., 2020).

5. Challenges and Open Research Directions

Key challenges identified across the literature:

Context fragmentation and speaker tracking: Interleaved dialogue complicates long-range dependence modeling; robust identification of speaker turns and coreference is critical (Jiang et al., 2019, Fu et al., 2024).
Data scarcity and speaker diversity: Small speaker populations foster overfitting; augmentation strategies (interpolation, chunk fusion) and transfer learning are essential for robustness (Fu et al., 2024).
Privacy and self-report reliability: Most corpora are based on scripted or public data, limiting generalizability and clinical applicability (Wen et al., 2024).
Trait-dependence of conversational cues: Acoustic and non-verbal signals correlate strongly with some traits (e.g., Extraversion) and are context-dependent, especially for Neuroticism under stress (Zhang et al., 25 Jul 2025).
Multimodal fusion: Text-only pipelines saturate; future work will emphasize integration of prosody, facial affect, and contextually synchronized modalities (Khan et al., 2020, Chen et al., 2022, Song et al., 2021).
Explainability and evidence: Movement toward chain-of-thought and evidence-supported PRC is ongoing; LLMs can generate natural language justifications and map local dialogue evidence to BFI-2 itemizations (Sun et al., 2024).

6. Interpretability, Applications, and Impact

Recent systems are trending toward interpretable PRC via natural language inference over trait descriptions (Affective-NLI), chain-of-thought evidence for every trait/state (CoPE/PersonalityEvd), and dialog-level explainability via LLM-generated summaries. These advances are especially relevant in HCI, personalized robotics, and mental health conversational agents, where transparent reasoning about user traits is imperative (Wen et al., 2024, Sun et al., 2024).

Moreover, PRC is foundational for adaptive response generation, both in monolingual and code-mixed dialogue, as shown by personality-conditioned sequence models (RoBERTa-based pseudo-labeling, Personality-Aware Axial Attention) which boost downstream generation metrics (e.g., ROUGE/BLEU) (Kumar et al., 2024).

7. Future Directions

Research in PRC foregrounds several priorities:

Expansion of multimodal datasets, especially in underrepresented languages and domains (Khan et al., 2020).
Development of unsupervised and semi-supervised evidence extraction, removing bias from partner labels and scripted sources (Yu et al., 2019, Kumar et al., 2024).
Hierarchical, context-sensitive, and trait-aware architectures integrating turn-level, dialogue-level, and cross-modal signal fusion (Jiang et al., 2019, Fu et al., 2024, Song et al., 2021).
Explainable PRC with unified state-to-trait inference, high-quality intermediate reasoning, and symbolic taxonomy grounding (Sun et al., 2024).
Situational and context-aware modeling: explicit treatment of changing trait expression under stress and sociocultural context, per empirical evidence for situational dependence of perceived personality (Zhang et al., 25 Jul 2025).

Comprehensively, PRC research demonstrates rapid methodological innovation, diversification of evaluation protocols and modalities, and alignment with core directions in interpretable and adaptive conversational AI.