Valence-Arousal Space in Affective Modeling
- Valence–arousal space is a two-dimensional framework that maps emotions using continuous measures of positivity (valence) and intensity (arousal).
- It employs human annotation, lexicon-based mapping, and algorithmic extraction to transform subjective ratings into quantitative coordinates.
- The framework supports multimodal affective computing by integrating metrics like CCC and RMSE for reliable emotion analysis across diverse applications.
The valence–arousal (VA) space is a foundational two-dimensional coordinate system for continuous emotion representation, widely adopted across affective science, computational modeling, and machine learning. Valence quantifies the hedonic axis—ranging from negative/unpleasant to positive/pleasant affect—while arousal indexes the intensity or activation level of the affective state, usually spanning from calm to excited. This circumplex framework underpins both theoretical models of emotion and practical affect recognition systems, allowing for fine-grained, language-independent, and modality-agnostic mapping of affective phenomena in humans and non-human agents.
1. Mathematical Formalization and Ranges
Valence–arousal space is canonically formalized as a Cartesian or polar plane. Let denote the continuous coordinates:
- Valence : typically (negative to positive), but or can appear depending on domain or dataset normalization (Kollias et al., 2018, Wagner et al., 23 Apr 2024, Zhou et al., 2023, Wrobel, 16 Nov 2025).
- Arousal : typically (calm to excited)—(Wagner et al., 23 Apr 2024, Kollias et al., 2018)—or for pet vocalizations (Huang et al., 9 Oct 2025), or for subjective ratings (Wrobel, 16 Nov 2025).
In Russell’s circumplex model, all affective states are points ; the origin is neutral (neither positive nor negative, neither energetic nor passive) (Nath et al., 2020, Wagner et al., 23 Apr 2024). The polar decomposition uses
with representing intensity and an affective angle.
Distance metrics:
- Euclidean:
- Manhattan: (Nath et al., 2020, Won et al., 2021, Zhao et al., 2020)
2. Methodologies for VA Annotation and Mapping
Several methodological paradigms have been established for assigning or learning VA coordinates:
2.1 Human Annotation
Human raters provide continuous valence and arousal assessments, often on visually anchored scales such as the Self-Assessment Manikin (SAM) (1–9 or 1–10) (Mendes et al., 2023, Wrobel, 16 Nov 2025). These anchors are explicitly defined:
- For valence: 1 = “very unpleasant”, 10 = “very pleasant”
- For arousal: 1 = “very calm”, 10 = “very excited” (Wrobel, 16 Nov 2025)
Standard practice applies linear rescaling: e.g., to fit [0,1] or targets (Mendes et al., 2023, Wagner et al., 23 Apr 2024).
2.2 Data-Driven or Proxy Mapping
- Lexicon-based mapping: Discrete emotion labels are mapped to using published resources such as the NRC VAD Lexicon (Won et al., 2021), or via empirical means and standard deviations computed from reference corpora (Nath et al., 2020).
- Proxy/animation methods: Participants create an expressive animation for a discrete label, then rate it themselves on VA axes, aggregating responses to derive coordinates (Wrobel, 16 Nov 2025).
- Anchored dimensionality reduction: Latent speech, text, or image features are projected into 2D with class anchoring, blending high-dimensional similarity preservation with psychological constraints (Zhou et al., 2023, Nath et al., 2020).
2.3 Algorithmic Extraction from Signal
For non-human vocalizations, acoustic energy, spectral features, and emotion-specific priors generate VA coordinates via normalization and weighted scoring algorithms (Huang et al., 9 Oct 2025).
3. Multi-Modal and Multi-Task Learning in VA Frameworks
Recent VA modeling leverages multimodal and multi-task architectures:
- Joint regression: Simultaneous prediction of and from deep representations, using shared encoders with separate regression heads, typically trained with mean squared error (MSE) or concordance correlation coefficient (CCC)-based losses (Liang et al., 13 Mar 2025, Zhang et al., 2020, Wagner et al., 23 Apr 2024).
- Fusion strategies: Audio-visual fusion via joint cross-attention mechanisms; such models attend to both intra- and inter-modal correlations for robust VA inference, showing improved CCC in both lab and wild datasets (Praveen et al., 2022, Zhang et al., 2020).
- Multi-task setups: VA regression is jointly supervised with categorical emotion classification and even auxiliary tasks (e.g., body size, gender, action units) for enhanced feature learning (Huang et al., 9 Oct 2025, Zhang et al., 2020, Wagner et al., 23 Apr 2024).
- Loss architectures: Weighted loss compositions (e.g., ) enforce primacy of VA prediction while harnessing auxiliary supervision (Huang et al., 9 Oct 2025, Liang et al., 13 Mar 2025, Wagner et al., 23 Apr 2024, Zhang et al., 2020).
4. Comparative Evaluation, Quantitative Metrics, and Expressiveness
Performance in VA prediction is evaluated via several quantitative criteria:
- Pearson correlation (): Linear agreement between predicted and true , (Huang et al., 9 Oct 2025, Mendes et al., 2023, Wagner et al., 23 Apr 2024).
- Concordance Correlation Coefficient (CCC): Measures both accuracy and precision; rewards high correlation and low mean/variance bias (Liang et al., 13 Mar 2025, Praveen et al., 2022, Mendes et al., 2023, Wagner et al., 23 Apr 2024).
- RMSE/MAE: Root mean square and mean absolute error from ground-truth (Mendes et al., 2023, Wagner et al., 23 Apr 2024).
- Downstream/cross-modal retrieval metrics: Macro-Precision@5, Macro-MRR in tasks like image–music retrieval (Zhao et al., 2020, Won et al., 2021).
Table: Example state-of-the-art scores for continuous VA regression
| Domain | Model / Data | Valence / CCC | Arousal / CCC | RMSE |
|---|---|---|---|---|
| Pet vocalization | Audio Transformer (Huang et al., 9 Oct 2025) | 0.9024 | 0.7155 | 0.1124 |
| Vision (facial) | MaxViT (Wagner et al., 23 Apr 2024) | 0.716 (CCC) | 0.642 (CCC) | 0.331, 0.305 |
| Multilingual text | XLM-RoBERTa (Mendes et al., 2023) | 0.810 | 0.695 | 0.109, 0.120 |
| Multimodal HCI | JCA/ABAW (Praveen et al., 2022) | 0.728 (CCC) | 0.842 (CCC) | – |
Advantages of continuous VA are consistently reported: resolution of boundary ambiguities between discrete categories, greater expressivity, direct human interpretability, and improved domain transfer (e.g., ~2% MAE drop cross–group vs. 10–15% drop for discrete) (Huang et al., 9 Oct 2025, Won et al., 2021, Wagner et al., 23 Apr 2024).
5. Theoretical Grounding: Free Energy, Information Dynamics, and Computational Accounts
Theoretical formalizations integrate VA space into probabilistic and information-theoretic models of affect:
- Free energy decomposition: In active inference and Bayesian thermodynamic accounts, arousal is mapped to the posterior entropy over hidden states (uncertainty), and valence is the difference between current utility and expected utility (risk reduction) (Pattisapu et al., 2 Jul 2024, Yanagisawa et al., 2022). Explicit formulas:
where is utility, is entropy, and the posterior over states.
- Emotional dynamics: Changes in free energy () induce valence shifts; successful reduction yields positive valence, increases yield negative valence; arousal is identified with “arousal potential” (complexity/novelty/conflict) (Pattisapu et al., 2 Jul 2024, Yanagisawa et al., 2022). Gaussian Bayesian models formalize how prior mean distance, variance, and prediction error shape VA coordinates into regions of “interest,” “confusion,” and “boredom.”
- Rate–distortion trade-off: Models such as LeVAsa explicitly demonstrate the representation-theoretic tension between densely aligning latent codes to the VA axes (improved alignment, interpretability) and preserving high reconstruction fidelity (rate–distortion principle) (Nath et al., 2020).
6. Application Domains and Empirical Coverage
Valence–arousal frameworks are implemented across an expanding set of application contexts:
- Vision: Continuous facial affect synthesis, facial expression regression, affective image retrieval (Kollias et al., 2018, Wagner et al., 23 Apr 2024, Nath et al., 2020).
- Audio: Speech emotion recognition via anchored dimensionality reduction from categorical labels, direct acoustic mapping in pet vocalizations (Huang et al., 9 Oct 2025, Zhou et al., 2023).
- Text: Multilingual VA regression in short texts, lexicon construction for affective computing (Mendes et al., 2023, Won et al., 2021).
- Cross-modal: Image-music matching by embedding both modalities in the same VA space and minimizing metric distance (Zhao et al., 2020, Won et al., 2021).
- Proxy-based mapping: Human-judged animation proxies as a self-grounded interface for mapping between discrete and VA representations (Wrobel, 16 Nov 2025).
- Agent models: Simulation of emotional trajectories in artificial agents employing expected free energy decomposition to VA signals (Pattisapu et al., 2 Jul 2024, Yanagisawa et al., 2022).
These approaches consistently demonstrate that the VA framework enables domain-agnostic affective modeling, enhances fine-grained emotion inference, and serves as a bridge between discrete and continuous taxonomies for both research and practical deployment.
7. Limitations, Extensions, and Open Issues
Despite widespread adoption, important caveats remain:
- Vocabularic restriction: Lexicon-based or manual mapping approaches rely on pre-existing word lists, limiting their flexibility for novel or multilingual domains; this motivates data-driven metric learning and more sophisticated transfer strategies (Won et al., 2021, Nath et al., 2020, Wrobel, 16 Nov 2025).
- Subjectivity and generalizability: Proxy-based mapping is human-centric and may not generalize across populations or cultures; standard deviations in VA self-ratings hover around 2–2.5 on 10-point scales (Wrobel, 16 Nov 2025).
- Supervised data scarcity: Dimensional VA annotations are harder to acquire than categorical labels, prompting hybrid solutions that leverage classification finetuning followed by reduction to VA space via anchored DR (Zhou et al., 2023).
- Model limitations: Current systems underperform on highly contextual, metaphoric, or low-resource language data, and struggle with ambiguous cases lying near the origin; extensions to dominance or other extra axes are proposed but not universally adopted (Mendes et al., 2023, Kollias et al., 2018, Wrobel, 16 Nov 2025).
- Theoretical modeling: Probabilistic and free-energy-based models show promise for unifying cognitive, affective, and computational paradigms but require further empirical validation and benchmarking (Pattisapu et al., 2 Jul 2024, Yanagisawa et al., 2022).
Future work aims to extend the VA paradigm to hierarchical or temporally recursive emotion accounting, integrate uncertainty quantification, augment multimodal generalization, and enrich cross-cultural span.
The valence–arousal space has become the de facto standard for dimensional affect modeling across disciplines, providing a compact, interpretable, and theoretically principled substrate for both cognitive science and modern affective machine learning (Kollias et al., 2018, Wagner et al., 23 Apr 2024, Wrobel, 16 Nov 2025, Huang et al., 9 Oct 2025, Mendes et al., 2023, Pattisapu et al., 2 Jul 2024).