ABCDE Dataset: Comprehensive Text Corpus
- ABCDE dataset is a comprehensive corpus of 403M text instances annotated for affect, cognition, body, and demographics.
- Its deterministic, lexicon-based pipeline enables auditable feature extraction from social media, literature, blogs, and AI-generated texts.
- The resource supports longitudinal and subgroup analyses, facilitating rigorous studies in computational affective science and digital humanities.
The ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion) is a large-scale, multi-source, feature-rich corpus designed for computational affective science, computational social science, and digital humanities research. Comprising over 400 million text utterances annotated with a comprehensive suite of affective, cognitive, demographic, and linguistic features, ABCDE provides standardized, lexicon-based annotations drawn from social media, books, blogs, and AI-generated texts. Its deterministic, auditable feature extraction pipeline and open-access distribution facilitate rigorous aggregate and subgroup analyses of affect, emotion, cognition, and demographic language patterns in large-scale textual data (Wahle et al., 19 Dec 2025).
1. Corpus Composition and Statistics
ABCDE contains 403 million text instances collected from five sources, each representing distinct temporal, topical, and register domains:
| Source | Instances (M) | Temporal Coverage |
|---|---|---|
| 45.2 | 2015–2021 | |
| 78.6 | 2010–2022 | |
| Books (Fiction) | 177.1 | 1800–2012 |
| Blogs | 34.2 | snapshot of 2008 |
| AI-generated | 68.9 | 2022–2025 |
Textual content spans informal social media (tweets, Reddit posts, personal blogs), fictional literature (5-grams from English Fiction), and a broad selection of AI outputs (LLM dialogues, reasoning chains, essays, detection benchmarks). Each instance represents either a single utterance (social media, blog post) or a short chunk (5-gram).
This heterogeneous composition enables temporally resolved studies from the 19th to 21st centuries and comparative analyses across human- and AI-generated language.
2. Feature Taxonomy and Annotation Approach
A total of 136 features are deterministically extracted for each text instance, grouped into five primary dimensions: Affect (VAD), Emotion, Body, Cognition, and Demographics.
Affect (Valence, Arousal, Dominance; VAD)
- Annotation uses NRC VAD Lexicon (Mohammad, 2018) for continuous scores on valence, arousal, and dominance within .
- For each instance and feature (e.g., valence), three aggregates are computed:
- valence_count
- valence_presence
- Binarized flags for high (≥ 0.66) and low (≤ 0.33) VAD.
Emotion (Discrete)
- Lexicons: NRC Emotion Intensity (Ekman’s eight: anger, anticipation, disgust, fear, joy, sadness, surprise, trust), WCST (warmth, competence, sociability, trust), NRC WorryWords (anxiety, calm).
- Word-level dictionary lookup with the same three aggregates as for VAD features.
Body
- 292 anatomical terms (unigrams, bigrams, trigrams) from Zhuang et al. (2024) and Wu et al. (2025).
- Features:
- body_pres (global 0/1)
- For each possessive pronoun (my, your, his, her, our, their): lists of co-occurring body parts (e.g., my_bpm).
Cognition
- 98 “thinking verbs” from Bloom’s Taxonomy and Queensland’s lists, grouped into 11 categories (e.g., Analyzing, Remembering, Deciding).
- For each category : cognition__pres if any associated verb is present.
Demographics
- Extraction from regexes and dictionary lookups:
Additional focus features include pronoun presence (first, second, third person) and tense (past, present, future from UniMorph morphological annotations).
All feature extracts are deterministic with public lexicons and hand-written regexes; there is no crowdsourcing. The annotation pipeline is auditable, enabling direct review of recall and precision.
3. Computational Metrics and Aggregation Formulas
Feature computation adheres to lexicon-based aggregate metrics without using neural or ML models (except for morphological tagging). For an instance and lexicon (with value mapping ):
- Raw count:
- Binary presence:
- Average score:
- Length-normalized density: (with = token count)
Thresholds for VAD features are defined as high (≥ 0.66) and low (≤ 0.33). Each feature is accompanied by the relevant count, intensity, and presence flag, enabling consistent cross-text and cross-source analysis.
4. Data Format, Distribution, and Access
ABCDE is openly distributed on Hugging Face and GitHub under a CC-BY-4.0 license. Data are organized as Apache Arrow tables and accessible as CSV/JSON through the Hugging Face datasets library. Each source is a separate data split (e.g., “twitter”, “reddit”, “books”, “blogs”, “ai_text”).
Schema highlights:
| Field | Type | Description |
|---|---|---|
| id | string | instance ID |
| text | string | raw text |
| source | string | source split |
| timestamp | datetime | post time (if available) |
| user_id | string | author (if available) |
| age, occupation, etc. | various | demographic attributes |
| valence_avg, ... | float/int/0/1 | VAD and emotion aggregate features |
| body_pres, my_bpm, ... | int/list | body features |
| cognition_*_pres | int (0/1) | cognition categories |
| pronoun/tense features | int (0/1) | focus/timing features |
There is no standardized train/dev/test partitioning; the corpus is intended for aggregate or longitudinal analyses with filtering and stratification as needed. Installation and loading (Python example):
1 2 3 |
from datasets import load_dataset ds = load_dataset("jpwahle/abcde") # all splits reddit = load_dataset("jpwahle/abcde", split="reddit") |
5. Analytical Usage and Best Practice Recommendations
ABCDE enables computational affective scientists to immediately conduct aggregate or subgroup analyses without additional annotation or model training. Key usage paradigms illustrated in the documentation include:
- Temporal affective trend analysis, e.g., tracking mean valence of tweets by year.
- Disentangling domain-specific focus, e.g., contrasting prevalence of body vs. cognition terms across literary history.
- Cross-feature demographic studies, e.g., identifying associations between occupational/age demographics and cognitive verb usage.
Code samples (Python/pandas) are provided for filtering, aggregation, and visualization, including stratification by time, source, or demographic attribute.
Best practices:
- Select presence vs. intensity features based on text length and research intent.
- Normalize feature counts by instance length for cross-domain comparability.
- Treat missing demographics as “unknown,” never as a negative label.
- For longitudinal analysis, utilize timestamp or year fields for chronological stratification.
- Lexicon-based features are robust for aggregate-level trends, but for high-precision individual prediction, fine-tuned models may be preferable.
6. Scope and Interdisciplinary Impact
By providing a unified, feature-rich annotation over diverse textual domains and timespans, ABCDE substantially lowers the barrier for affective, cognitive, sociolinguistic, and digital humanities research requiring large-scale, attribute-labeled text. The deterministic methodology and open-access infrastructure facilitate replicable, cross-disciplinary studies involving affect, cognition, and demographic processes in language (Wahle et al., 19 Dec 2025).
This resource enables, for example, computational social scientists to quantify change over time in emotional expression, track demographic-specific language trends, investigate AI-human language convergence, and conduct hypothesis-driven analysis without the confounds of manual annotation and opaque pipelines. A plausible implication is that ABCDE will stimulate more standardized, transparent research methodologies across affective sciences and neighboring fields.