Papers
Topics
Authors
Recent
2000 character limit reached

ABCDE Dataset: Comprehensive Text Corpus

Updated 26 December 2025
  • ABCDE dataset is a comprehensive corpus of 403M text instances annotated for affect, cognition, body, and demographics.
  • Its deterministic, lexicon-based pipeline enables auditable feature extraction from social media, literature, blogs, and AI-generated texts.
  • The resource supports longitudinal and subgroup analyses, facilitating rigorous studies in computational affective science and digital humanities.

The ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion) is a large-scale, multi-source, feature-rich corpus designed for computational affective science, computational social science, and digital humanities research. Comprising over 400 million text utterances annotated with a comprehensive suite of affective, cognitive, demographic, and linguistic features, ABCDE provides standardized, lexicon-based annotations drawn from social media, books, blogs, and AI-generated texts. Its deterministic, auditable feature extraction pipeline and open-access distribution facilitate rigorous aggregate and subgroup analyses of affect, emotion, cognition, and demographic language patterns in large-scale textual data (Wahle et al., 19 Dec 2025).

1. Corpus Composition and Statistics

ABCDE contains 403 million text instances collected from five sources, each representing distinct temporal, topical, and register domains:

Source Instances (M) Temporal Coverage
Twitter 45.2 2015–2021
Reddit 78.6 2010–2022
Books (Fiction) 177.1 1800–2012
Blogs 34.2 snapshot of 2008
AI-generated 68.9 2022–2025

Textual content spans informal social media (tweets, Reddit posts, personal blogs), fictional literature (5-grams from English Fiction), and a broad selection of AI outputs (LLM dialogues, reasoning chains, essays, detection benchmarks). Each instance represents either a single utterance (social media, blog post) or a short chunk (5-gram).

This heterogeneous composition enables temporally resolved studies from the 19th to 21st centuries and comparative analyses across human- and AI-generated language.

2. Feature Taxonomy and Annotation Approach

A total of 136 features are deterministically extracted for each text instance, grouped into five primary dimensions: Affect (VAD), Emotion, Body, Cognition, and Demographics.

Affect (Valence, Arousal, Dominance; VAD)

  • Annotation uses NRC VAD Lexicon (Mohammad, 2018) for continuous scores on valence, arousal, and dominance within [0,1][0,1].
  • For each instance II and feature (e.g., valence), three aggregates are computed:
    • Valence(I)=1{wIL}wILsvalence(w)\mathrm{Valence}(I) = \frac{1}{|\{w \in I \cap L\}|} \sum_{w\in I\cap L} s_{\mathrm{valence}}(w)
    • valence_count =IL= |I\cap L|
    • valence_presence =1IL>0= 1_{|I\cap L|>0}
  • Binarized flags for high (≥ 0.66) and low (≤ 0.33) VAD.

Emotion (Discrete)

  • Lexicons: NRC Emotion Intensity (Ekman’s eight: anger, anticipation, disgust, fear, joy, sadness, surprise, trust), WCST (warmth, competence, sociability, trust), NRC WorryWords (anxiety, calm).
  • Word-level dictionary lookup with the same three aggregates as for VAD features.

Body

  • 292 anatomical terms (unigrams, bigrams, trigrams) from Zhuang et al. (2024) and Wu et al. (2025).
  • Features:
    • body_pres (global 0/1)
    • For each possessive pronoun (my, your, his, her, our, their): lists of co-occurring body parts (e.g., my_bpm).

Cognition

  • 98 “thinking verbs” from Bloom’s Taxonomy and Queensland’s lists, grouped into 11 categories (e.g., Analyzing, Remembering, Deciding).
  • For each category CjC_j: cognition_CjC_j_pres =1=1 if any associated verb is present.

Demographics

  • Extraction from regexes and dictionary lookups:
    • Age: explicit patterns (e.g., “I am 25”) and temporal metadata.
    • Occupation: self-disclosures parsed via BLS SOC codes.
    • Gender, country, city, religion: mapped from controlled vocabularies (Wikipedia, Geonames, world religions).

Additional focus features include pronoun presence (first, second, third person) and tense (past, present, future from UniMorph morphological annotations).

All feature extracts are deterministic with public lexicons and hand-written regexes; there is no crowdsourcing. The annotation pipeline is auditable, enabling direct review of recall and precision.

3. Computational Metrics and Aggregation Formulas

Feature computation adheres to lexicon-based aggregate metrics without using neural or ML models (except for morphological tagging). For an instance II and lexicon LL (with value mapping s(w)s(w)):

  • Raw count: countL(I)=wI1wL\mathrm{count}_L(I) = \sum_{w \in I} 1_{w \in L}
  • Binary presence: presenceL(I)=min(1,countL(I))\mathrm{presence}_L(I) = \min(1, \mathrm{count}_L(I))
  • Average score: average_scoreL(I)=1max(1,countL(I))wILs(w)\mathrm{average\_score}_L(I) = \frac{1}{\max(1, \mathrm{count}_L(I))} \sum_{w \in I \cap L}s(w)
  • Length-normalized density: densityL(I)=countL(I)I\mathrm{density}_L(I) = \frac{\mathrm{count}_L(I)}{|I|} (with I|I| = token count)

Thresholds for VAD features are defined as high (≥ 0.66) and low (≤ 0.33). Each feature is accompanied by the relevant count, intensity, and presence flag, enabling consistent cross-text and cross-source analysis.

4. Data Format, Distribution, and Access

ABCDE is openly distributed on Hugging Face and GitHub under a CC-BY-4.0 license. Data are organized as Apache Arrow tables and accessible as CSV/JSON through the Hugging Face datasets library. Each source is a separate data split (e.g., “twitter”, “reddit”, “books”, “blogs”, “ai_text”).

Schema highlights:

Field Type Description
id string instance ID
text string raw text
source string source split
timestamp datetime post time (if available)
user_id string author (if available)
age, occupation, etc. various demographic attributes
valence_avg, ... float/int/0/1 VAD and emotion aggregate features
body_pres, my_bpm, ... int/list body features
cognition_*_pres int (0/1) cognition categories
pronoun/tense features int (0/1) focus/timing features

There is no standardized train/dev/test partitioning; the corpus is intended for aggregate or longitudinal analyses with filtering and stratification as needed. Installation and loading (Python example):

1
2
3
from datasets import load_dataset
ds = load_dataset("jpwahle/abcde")  # all splits
reddit = load_dataset("jpwahle/abcde", split="reddit")

5. Analytical Usage and Best Practice Recommendations

ABCDE enables computational affective scientists to immediately conduct aggregate or subgroup analyses without additional annotation or model training. Key usage paradigms illustrated in the documentation include:

  • Temporal affective trend analysis, e.g., tracking mean valence of tweets by year.
  • Disentangling domain-specific focus, e.g., contrasting prevalence of body vs. cognition terms across literary history.
  • Cross-feature demographic studies, e.g., identifying associations between occupational/age demographics and cognitive verb usage.

Code samples (Python/pandas) are provided for filtering, aggregation, and visualization, including stratification by time, source, or demographic attribute.

Best practices:

  • Select presence vs. intensity features based on text length and research intent.
  • Normalize feature counts by instance length for cross-domain comparability.
  • Treat missing demographics as “unknown,” never as a negative label.
  • For longitudinal analysis, utilize timestamp or year fields for chronological stratification.
  • Lexicon-based features are robust for aggregate-level trends, but for high-precision individual prediction, fine-tuned models may be preferable.

6. Scope and Interdisciplinary Impact

By providing a unified, feature-rich annotation over diverse textual domains and timespans, ABCDE substantially lowers the barrier for affective, cognitive, sociolinguistic, and digital humanities research requiring large-scale, attribute-labeled text. The deterministic methodology and open-access infrastructure facilitate replicable, cross-disciplinary studies involving affect, cognition, and demographic processes in language (Wahle et al., 19 Dec 2025).

This resource enables, for example, computational social scientists to quantify change over time in emotional expression, track demographic-specific language trends, investigate AI-human language convergence, and conduct hypothesis-driven analysis without the confounds of manual annotation and opaque pipelines. A plausible implication is that ABCDE will stimulate more standardized, transparent research methodologies across affective sciences and neighboring fields.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ABCDE Dataset.