PU Metrics in Temporal Emotion Dynamics

Updated 11 November 2025

Positive-Unlabeled (PU) metrics are quantitative tools that define and measure temporal dynamics of emotional states through occurrence, duration, and transition probabilities.
They are applied across EEG, dialogue, and music arrangement systems to enhance the analysis and modeling of shifting affective and behavioral signals.
Advanced implementations integrate EST features into learning architectures using graph models and dynamic attention, improving the system’s adaptiveness and emotional coherence.

Positive-Unlabeled (PU) metrics—often referred to more generally as emotion state transition (EST) features or emotion-shift metrics—characterize the dynamics of emotional states as they evolve over time or across conversational turns, neural microstates, or other behavioral signals. Unlike classical static emotion classification metrics, EST metrics focus on quantifying not only the prevalence but also the transition patterns, durations, and contextual dependencies of emotional states. In recent years, the formalization and application of PU/EST metrics have enabled advanced modeling of temporal affect dynamics in domains as diverse as EEG-based emotion recognition, music generation, dialogue systems, and conversational emotion recognition. EST metrics enable the systematic study of both within-state stationarity and between-state lability, allowing both fine-grained empirical analysis and the development of systems optimized for emotional coherence and adaptability.

1. Mathematical Formulations of EST Metrics

Across domains, EST metrics formalize and quantify the micro- and macro-dynamics of emotional states as time series or stochastic processes over discrete or continuous emotion representations. Fundamental EST metrics and their analytic expressions include:

State Occurrence: Given a sequence of state labels $\{s_t\}$ , the occurrence of state $i$ is

$\mathrm{Occ}_i = \frac{N_i}{T}$

where $N_i$ is the number of unique visits to state $i$ and $T$ is the total number of segments or time windows analyzed.

Mean Duration: For state $i$ ,

$\mathrm{Dur}_i = \frac{1}{N_i} \sum_{k=1}^{N_i} d_{i,k}$

where $d_{i,k}$ is the length of the $k^{th}$ visit.

Coverage:

$\mathrm{Cov}_i = \frac{\sum_{k=1}^{N_i} d_{i,k}}{D_{\mathrm{total}}}$

with $D_{\mathrm{total}}$ the total time analyzed.

Transition Probabilities: The empirical transition probability from state $i$ to state $j$ ,

$P_{i \to j} = \frac{N_{i\to j}}{\sum_{\ell} N_{i\to \ell}}$

with $N_{i\to j}$ the count of direct transitions.

In vector-valued or continuous representations (e.g., VAD space), EST metrics are derived from differences:

$\Delta e = e_{\mathrm{response}} - e_{\mathrm{preceding}}$

and modulated by external factors (e.g., personality weights) for more individualized modeling.

Hybrid metrics incorporate music-theoretic features (harmonic color, contour factor), spatiotemporal EEG attentions, and cross-modal context shifts, integrating EST signatures deeply into the data representation and model architectures.

2. Extraction Methodologies Across Modalities

EEG Microstate Analysis

EEG EST metrics are extracted via microstate segmentation: continuous EEG is partitioned into segments characterized by stable topographies (microstates), typically via GFP-peak-driven k-means clustering. Emotional and neutral epochs are compared on occurrence, duration, coverage, and transition probabilities of identified microstates (e.g., MS1–MS4). Temporal smoothing ensures physiological plausibility (segment duration threshold: 30 ms). Transition matrices are built empirically from assigned microstate label sequences.

In dynamic-attention EEG architectures, EST features take the form of attention-weighted spatiotemporal latent components $\tilde X^{latent}_{k,t} = \alpha_{k,t} X^{latent}_{k,t}$ , where time-varying $\alpha_{k,t}$ values encode the activation of latent neural processes, with the EST trajectory directly captured as the $K\times T$ matrix of these attended activations.

Dialogue and Text-based Interaction

In conversational modeling, EST features are extracted as transition information between emotion embeddings or predicted emotion labels of consecutive turns:

In VAD-space-based models, context encoding via pretrained LLMs yields an aggregated representation, which is mapped to a VAD variation $\Delta e$ and modulated by personality parameters before summing with the previous emotion vector.
In multi-turn dialogue modeling, EST metrics are explicitly computed as transition vectors between emotion predictions or semantic/keyword features across utterances. These transitions are formalized (e.g., via squared differences and Hadamard products) and passed through learned projections and non-linearities for integration into the system.

In graph-based state propagation (e.g., TransESC), transition and interaction relations are represented as edge-typed updates among semantics, strategy, and emotion states, tracked across sliding windows of dialogue history with dedicated multi-head attention for within-type (same emotion/strategy/semantics) and cross-type (emotion–strategy, etc.) influences.

Music Arrangement Systems

Systems such as REMAST compute EST features for musical bars by fusing recognized prior emotion (in V–A space) and emergent target emotion either via simple arithmetic (V–A midpoint) or via learned concatenation/projection of both V–A and rich music-theoretic features (embeddings of previous bar). Multiple metrics—harmonic color, contour factor, form factor, rhythm pattern—are computed at each segment to reinforce emotional salience and are pivotal for both recognition and generation tasks.

3. Integration of EST Metrics Into Learning Architectures

In both discriminative and generative contexts, EST metrics are leveraged to improve system adaptability and coherence.

Sequence Models and GRUs

In dialogue emotion recognition, EST metrics (emotion-shift probabilities) directly parameterize the forget/retain gates of global emotion GRUs, replacing learned gates with data-driven inertia/shift probabilities. For example, a learned scalar $p_{\text{shift}}$ derived by a Siamese subnetwork replaces both reset and update gates:

$e_t = (1 - p_{\text{shift}}) \odot e_{t-1} + p_{\text{shift}} \odot \widetilde{e}_t$

enabling adaptive memory of prior emotional context only when no shift is detected.

Graph and Attention Mechanisms

Graph-based EST frameworks propagate both transition and interaction signals of emotion, semantics, and response strategy over edges within one or more dialogue windows. Gated fusion of transit and interact outputs enables nuanced generation conditioned on smooth turn-level state transitions.

Dynamic attention schemes over EEG spatiotemporal components in DAEST systems yield a temporally evolving mixture of neural subprocesses, with learned weights $\alpha_{k,t}$ marking the onset/offset of specific emotional substates. This supports both cross-subject invariance (via contrastive objectives) and interpretable decoding.

Conditioning and Feature Fusion

Generative systems—e.g., emotion-aligned music arrangement—use EST features as fused conditioning to ensure soft, continuous emotional evolution of outputs. In dialogue generation, emotion and strategy transitions are injected at multiple points (inputs, cross-attention, hidden state fusion), operationalizing EST metrics as controlling variables for content and affective flow.

4. Empirical Comparisons and Quantitative Impact

EST metrics consistently drive quantitative improvements in both recognition and generation tasks:

In EEG classification, microstate-based EST features (occurrence, coverage, transition) differentiate emotional from neutral epochs with Bonferroni-Holm–corrected significance ( $p<0.001$ –$0.05$). Dynamic-attention ESTs yield state-of-the-art cross-subject accuracy: 75.4% (FACED binary), 88.1% (SEED 3-way), and substantial gains on fine-grained, multi-class datasets.
In music arrangement, fusion-based EST conditioning via REMAST achieves best-in-class overall coherence (6.21), similarity (7.60), and real-time fit (2.02), with statistically significant subjective gains ( $p<0.03$ ) in transition smoothness and emotional fidelity.
In conversational recognition, integrating emotion-shift (EST) with multimodal ERC models yields up to +3 F1 points improvement, with particularly strong gains (up to +20 pts) on utterances featuring actual emotional polarity switches.
In dialogue generation, turn-level EST modeling (via graphs or explicit feature difference) leads to superior perplexity, diversity, empathy, and relevance scores compared to state-of-the-art baselines. For example, Emp-RFT achieves the best top-1 next-emotion prediction (42.08%) and meaningful reductions in perplexity (13.59 vs. 13.21–16.96).

Empirical ablations confirm that removal of explicit EST recognition degrades performance on key metrics, especially for multi-turn or context-dependent tasks.

5. Domain-Specific Implementation Considerations

The practical use of EST metrics necessitates careful alignment of methodology to modality:

EEG: Reliable microstate segmentation requires band-pass filtering, ICA-based artifact removal, and robust temporal smoothing. Online implementations must replace manual or offline steps (e.g., artifact detection) with real-time compatible surrogates. Choice of microstate templates and window length critically influence reliability, with trade-offs between resolving transient states and maintaining stability.
Dialogue/Text: EST extraction depends on high-precision emotion and semantic annotation; models may rely on external detectors whose errors can propagate. Window length for feature transitions (number of previous turns compared) affects the ability to track long-range mood evolution. In personality-affected EST workflows, mapping Big-Five traits to VAD weights via established regression models is operational, but requires reliable personality annotation.
Music: Downsampling granularity (e.g., quarter-note) balances real-time fit against signal stability and interpretability of transitions. Fusion strategy (simple arithmetic vs. learned feature concat) directly influences the alignment between emotional target curves and musical coherence.

In all domains, real-time or streaming instantiations of EST estimation must minimize computational cost (e.g., nearest-prototype search, linear fusion) and handle annotation noise.

6. Statistical Testing and Interpretation

Rigorous validation of EST metric differences employs permutation tests, Bonferroni-Holm correction for multiple comparisons, and bootstrap resampling where applicable. For source-localization in EEG, voxel-wise inference with 5,000 bootstrap iterations is used. In dialogue and music applications, human subject evaluations complement automatic metrics, with significance assessed via appropriate $p$ -value corrections.

EST metrics are not in themselves classification or prediction targets, but their distributions across emotion classes or contexts provide critical evidence for the dynamic organization of affect—in particular, supporting constructionist views in affective neuroscience (as seen in the microstate data pointing to asymmetric bottom-up/top-down contributions) and enabling the operationalization of emotional smoothness in synthetic systems.

7. Limitations and Future Research Directions

Several limitations emerge from current EST metric methodologies:

Modality-Specificity: Template-based segmentation, feature fusion, and annotation mapping are highly specific to the chosen domain. Subject- and context-dependence necessitate calibration or adaptation for robust real-time use.
Temporal Resolution Constraints: Estimation windows and smoothing parameters constrain the granularity at which transitions can be detected, potentially missing rapid or long-term shifts depending on configuration.
Error Propagation from Upstream Annotation: In dialogue and multi-modal systems, reliance on external supervised detectors for annotation introduces noise which is not explicitly accounted for in transition modeling.
Empirical Coverage: Some systems (e.g., MSTN-based tourist navigation) report per-transition costs and intuitionistic results but lack direct quantitative accuracy evaluation of transition prediction itself.

Current and future research focuses on: end-to-end EST learning (removing dependence on pre-annotators); dynamic, context-driven window sizing; richer fusion of commonsense, pragmatic, and affective signals; and domain adaptation or meta-learning frameworks for improved cross-subject or cross-corpus EST transferability.

Collectively, formal EST metrics and PU-derived methodologies have become central for rigorous, quantitative study and application of emotion dynamics across neurophysiological, behavioral, and computational systems, forming a foundation for temporally sensitive, context-aware, and human-aligned affective technologies.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Positive-Unlabeled (PU) Metrics.