EmoBank: Benchmark for Emotion Analysis

Updated 9 February 2026

EmoBank is a large-scale English corpus featuring 10,062 sentences annotated for both writer and reader emotions using dimensional (VAD) and categorical formats.
Its dual-perspective design, with robust inter-annotator agreement (e.g., Pearson r up to 0.738), enables precise emotion regression and mapping studies.
EmoBank underpins affective computing by benchmarking regression models and powering real-time affective memory in reflective agent architectures.

EmoBank is a large-scale English-language corpus, methodological framework, and applied affective memory system, developed for the fine-grained analysis, modeling, and operationalization of emotions in both human and artificial agent contexts. Its central contribution arises from a bi-perspectival and bi-representational annotation design, encompassing both dimensional (Valence–Arousal–Dominance, VAD) and categorical (Basic Emotions) formats under both writer and reader perspectives. EmoBank has established itself as a benchmark corpus for emotion regression and mapping studies, while also supplying a robust substrate for runtime affective state tracking in reflective agent architectures.

1. Corpus Design and Annotation Schema

EmoBank consists of 10,062 English sentences after filtering, sourced from seven balanced genres: news headlines, blogs, essays, fiction, letters, newspapers, and travel guides. The composition process included a stringent filtering phase, discarding implausible ratings (10%, where all three VAD dimensions were at their lower bound) to improve data integrity (Buechel et al., 2022).

Each sentence is annotated twice: once for the emotion expressed by the writer (“Expressing Emotion”) and once for the emotion likely to be evoked in an average reader (“Evoking Emotion”). Ratings employ a five-point variation of the Self-Assessment Manikin (SAM), yielding for each dimension:

Valence (Pleasure): 1 = very unhappy, 3 = neutral, 5 = very happy
Arousal: 1 = very calm/sluggish, 3 = neutral, 5 = very excited/aroused
Dominance: 1 = very submissive/controlled, 3 = neutral, 5 = very dominant/in-control

Annotation tasks were distributed as independent CrowdFlower jobs, with five pre-screened annotators per dimension and perspective, ensuring high label reliability (Buechel et al., 2022).

2. Annotation Perspectives, Reliability, and Statistical Properties

The dual-perspective annotation (writer, reader) permits investigation of subjectivity and evoked affect. Inter-annotator agreement (IAA), evaluated by Pearson’s $r$ and mean absolute error (MAE), reveals: | Perspective | Valence (r) | Arousal (r) | Dominance (r) | Avg (r) | Valence (MAE) | Arousal (MAE) | Dominance (MAE) | Avg (MAE) | |:-------:|:----:|:----:|:----:|:---:|:----:|:----:|:----:|:---:| | Writer | 0.698 | 0.578 | 0.540 | 0.605 | 0.300 | 0.388 | 0.316 | 0.335 | | Reader | 0.738 | 0.595 | 0.570 | 0.634 | 0.349 | 0.441 | 0.367 | 0.386 |

The reader perspective achieves significantly higher correlation-based IAA (Valence, $p < .05$ ) and greater “emotionality”—operationalized as average absolute deviation from the neutral scale point—without introducing unexplained annotation error. A regression analysis confirms that differences in annotation error between perspectives are fully explained by increased emotionality, not noise, as evidenced by near-zero regression intercept ( $p = .992$ ) (Buechel et al., 2022).

3. Bi-Representational Format: Dimensional and Categorical Emotion Mapping

A critical feature is the integration of categorical emotion data. For a subset overlapping with SemEval-2007 Task 14, annotations are available both in VAD and in Ekman’s six Basic Emotions (Joy, Anger, Sadness, Fear, Disgust, Surprise) on a [0,100] scale. This supports direct machine learning-based mapping between dimensional and categorical affect representations.

k-nearest neighbor regression models, trained to predict each basic emotion score from VAD vectors, realize “near-human” mapping performance. Combined writer+reader input attains Pearson $r \approx .52$ , matching or surpassing the inter-annotator agreement (IAA) for categorical emotion labeling by experts (average $r = .54$ ). In the cases of Joy and Sadness, the regression-based mapping outperforms human agreement (Buechel et al., 2022). This finding substantiates the claim that high-quality dimensional VAD annotations can substitute for resource-intensive categorical labeling.

4. EmoBank in Model Development and Benchmarking

EmoBank’s format, genre diversity, and reliability metrics make it foundational for benchmarking regression models and multi-task frameworks focused on emotion detection and estimation. VADEC, a multi-task architecture, exemplifies model usage. It treats EmoBank’s VAD targets as regression objectives, with a per-dimension continuous regression target in $[1,5]$ .

Experimentally, state-of-the-art models using VADEC and variants achieve Pearson $r$ values for Valence, Arousal, and Dominance of up to 0.823, 0.583, and 0.511, respectively. For instance, VADEC trained on SenWave reaches $r = 0.823$ for Valence and $r = 0.485$ for Dominance. The pipeline leverages a shared BERTweet encoder and mean squared error loss over the three VAD dimensions, supporting the assertion of EmoBank’s value for affective computing tasks at sentence-level granularity (Mukherjee et al., 2021).

5. EmoBank as Affective Memory in Agent Systems

Distinct from its utility as a natural language corpus, EmoBank has been repurposed for affective state tracking in reflective agentic runtimes, where it operates as a persistent “affective memory” module. In the VIGIL architecture, EmoBank maintains a log of emotional appraisals, each tagged by emotion type (e.g., frustration, anxiety, pride), valence ( $\in\{-1,0,+1\}$ ), intensity [$0,1$], timestamp, cause, and episodic hash.

Key mechanisms include:

Exponential half-life decay: For entry with intensity $I$ at $t$ , decayed intensity at $t'$ is $I_{\text{decay}} = I \times 0.5^{\frac{t'-t}{h}}$ where $h=12$ h by default.
Deposit policies: Noise floor filtering ( $I<0.25$ dropped), coalescing of events within 5 min (intensity boosted, capped at 1.0), and rebound injection (e.g., inserting “determination” after relief follows frustration).
Aggregation: Read-time snapshots compute dominant emotions and composite metrics for downstream diagnosis (e.g., Roses/Buds/Thorns, RBT) in system introspection (Cruz, 8 Dec 2025).

This operationalization ensures that decaying affective traces direct adaptation and error correction, and emotional context remains synchronized with ongoing self-healing adaptation cycles in reflective agent designs.

6. Key Findings, Implications, and Applications

Empirical analyses highlight the “supremacy of the reader perspective” for reliable VAD annotation, the sufficiency of dimensional annotations for reconstructing categorical emotion labels, and the tractability of sentence-level affect regression for both evaluation and operational self-reflection pipelines. In neural and pipeline-based affect modeling, EmoBank’s rigorous design and transparency support reproducibility and extensibility in both academic and applied settings.

By bridging corpus-driven affect representation, model benchmarking, and real-time affective memory, EmoBank has advanced the methodological foundation for systematic emotion analysis and its integration into agentic cognitive architectures (Buechel et al., 2022, Mukherjee et al., 2021, Cruz, 8 Dec 2025).