Papers
Topics
Authors
Recent
2000 character limit reached

Mental Health Reddit Wellbeing Dataset Overview

Updated 12 December 2025
  • The Mental Health Reddit Wellbeing Dataset is a large, multifaceted collection of Reddit posts covering self-reported mental states, social support dynamics, and wellness factors.
  • It employs strict privacy protocols, thorough cleaning methods, and diverse annotation frameworks (clinical, TLDR, and explanatory) to ensure data quality and ethical compliance.
  • Advanced NLP models, including transformer-based architectures, leverage this dataset to enhance prediction accuracy in symptom detection and mental health analysis.

Reddit hosts multifaceted, large-scale datasets capturing self-narratives and peer interactions relevant to mental health and well-being. The "Mental Health Reddit Wellbeing Dataset" label is often used in reference to resources constructed for analysis of mental distress, social support, symptomatology, well-being events, and prediction tasks in these open online communities. Datasets span single posts, longitudinal sequences, annotated summaries, predictive signals, and expert-curated concept spans, offering coverage of constructs from clinical depression (DSM-5, BDI-II), lonesomeness, stress and causal antecedents, social support effectiveness, to holistic wellness theory. Across these datasets, collection and annotation leverage strict privacy protocols, IRB-guided ethics, and substantial expert involvement to enable robust computational modeling of mental health indicators in textual Reddit data.

1. Principal Reddit Mental Health Wellbeing Datasets

The landscape comprises both large and targeted corpora distinguished by coverage, labels, and theoretical basis:

Dataset Size (posts) Label scope Annotation type
MentSum 24,119 Post-TLDR pairs User-written TLDR summary
RedditESS 59,666 Effective social support Ensemble, expert, LLM
SMHD 1.3M posts Self-reported diagnosis Pattern-based, manual QA
Dreaddit 190,000 Stress (acute/chronic) MTurk segment annotation
WellXplain 3,092 Wellness dimension (4-way) Human span + label
Holistix 1,420 Wellness dimension (6-way) Multi-span, consensus
BeCOPE 10,118 Intent, criticism, emotion Manual + pseudo-label
CAMS 5,051 Causal reason (6-way) Human span + category
LonXplain 3,521 Lonesomeness (binary) Span labeled (explainable)
ReDSM5 1,484 DSM-5 depression symptoms Sentence + rationale
ReDepress 2,600 Relapse, cognitive bias Clinician-coded timeline

These datasets target the analysis of individual distress, peer dynamics, symptom expression, causality, and holistic wellness. Some, such as MentSum, prioritize summarization, while RedditESS systematically evaluates support effectiveness through reciprocal feedback and crowd reception (Sotudeh et al., 2022, Alghamdi et al., 27 Mar 2025).

2. Data Acquisition, Cleaning, and Privacy Safeguards

Most repositories employ Python-based scraping tools (PRAW, ParseHub) to extract posts and comments from specific mental-health subreddits (e.g., r/depression, r/anxiety, r/SuicideWatch, r/MentalHealth, r/PTSD). Standard cleaning steps are removal of user IDs, URLs, and other PII, lowercasing, de-duplication, and discarding off-topic records (Joseph et al., 2021, Sotudeh et al., 2022, Garg, 2023, Alghamdi et al., 27 Mar 2025, Saeed et al., 6 Mar 2025). For some resources, data are filtered for explicit self-disclosures (SMHD, eRisk), for post lengths or first-person narrative, or by time window (typically 2010–2022).

Ethical considerations are paramount; datasets restrict to public posts, anonymize all identifiers, and require agreement to usage terms aligned with Reddit's policies and IRB guidance, with further measures including exclusion of deleted content or patient-specific info (Sotudeh et al., 2022, Garg, 2023, Mao et al., 6 Dec 2025).

3. Annotation Frameworks and Labeling Strategies

Annotation approaches reflect the theoretical underpinning and research goals:

  • Emotion, affect, and wellness: Cognitive network analysis links word-stems to affective scores and NRC emotion tags to characterize frames of feeling (Joseph et al., 2021). Wellness labels in WellXplain and Holistix follow Dunn/Hettler models, assigning posts to physical, social, intellectual, spiritual/emotional, or vocational aspects, with labeled spans justifying each tag (Garg, 2023, Shakeel et al., 13 Jul 2025).
  • Social support and peer interaction: RedditESS fuses expert annotation with ensemble heuristics, applying regular expressions and sentiment scoring for reciprocity, community feedback, and gratitude, establishing binary "effective support" (ESS) and rich taxonomies of support type (emotional, appraisal, informational, instrumental) (Alghamdi et al., 27 Mar 2025).
  • Symptoms and clinical criteria: ReDSM5 aligns each sentence of long-form posts with DSM-5 depressive symptoms and clinical justification, providing gold rationales for explanation-based modeling (Bao et al., 5 Aug 2025). SMHD annotates users for clinical diagnosis via explicit self-reporting, pairing each case with overall activity-matched controls (Chen et al., 2023).
  • Causal and cognitive markers: CAMS categorizes posts into six causal classes (abuse, relationship, alienation, medication, work, none) with span-based justifications. ReDepress annotates user timelines for cognitive dimensions—attention, memory, interpretation bias, rumination—employing clinician consensus and mapping each post to a vector of quantitative cognitive markers (Agarwal et al., 22 Sep 2025, Garg et al., 2022).
  • Event and well-being scoring: The CLPsych 2025 dataset and its CFD-enriched variant combine manual annotation of life-event taxonomies (mental/physical health, relationship, career, lifestyle, etc.) with well-being scores, using both human and multi-agent LLM consensus frameworks (Mao et al., 6 Dec 2025).

Inter-annotator agreement metrics (Cohen's/Fleiss' κ) are routinely reported, with substantial reliability in most dimensions (e.g., κ=0.74 for WellXplain's 4-class labeling, κ=0.76 for user-level relapse in ReDepress) (Garg, 2023, Agarwal et al., 22 Sep 2025, Mao et al., 6 Dec 2025).

4. Feature Engineering, Modeling, and Benchmarks

Feature construction incorporates linguistic (word stems, TF–IDF, embeddings), affective (valence, emotion sets), structural (co-occurrence, centrality), and cognitive cues. Model pipelines range from classic statistical methods (Logistic Regression, SVM, Random Forest) to advanced neural architectures (BiLSTM, CNN, transformer-based encoders such as BERT, ALBERT, RoBERTa, MentalBERT) (Garg, 2023, Shakeel et al., 13 Jul 2025, Chen et al., 2023).

Multimodal and temporal models, such as those in Early Detection or ReDepress, fuse textual features with posting intervals, affect trajectories, and cognitive markers. Transformer-based methods generally show significant accuracy and F1 improvements over shallow models (e.g., MentalBERT F1 ≈ 0.78 for wellness dimension detection in WellXplain and Holistix; SBERT-CNN F1 = 0.86 for depression user detection) (Chen et al., 2023, Garg, 2023, Shakeel et al., 13 Jul 2025).

Ensemble labeling and attention mechanisms support nuanced understanding of context and modality importance (e.g., cross-modal attention weights text versus temporal features per instance) (Saeed et al., 6 Mar 2025).

5. Explanatory and Interpretability Mechanisms

Datasets such as WellXplain, Holistix, CAMS, and LonXplain explicitly include human-marked textual spans with character offsets, enabling sequence labeling and model interpretability. Post-hoc explanation tools (LIME, ROUGE-L, BLEU) measure model rationale alignment with expert spans, supporting transparency and trust in model outputs (Garg, 2023, Shakeel et al., 13 Jul 2025, Garg et al., 2023).

ReDSM5 advances explainability by associating every symptom label with clinical rationale and enables explanation generation tasks (e.g., LLM semantic similarity, judge aggregation), providing a foundation for interpretable, rationale-aware mental health models (Bao et al., 5 Aug 2025).

6. Applications, Limitations, and Ethical Guidelines

Key use cases include:

Limitations reflect Reddit’s user demographics and selection bias, English-language dominance, possible annotation error (especially in pseudo-labeled and crowd-sourced components), and potential domain drift across data vintages and platforms. Researchers are cautioned that these datasets do not substitute for clinical diagnosis or therapy and must not be used for direct interventional decisions without professional oversight. Uses must strictly adhere to privacy and data-sharing statutes (Alghamdi et al., 27 Mar 2025, Garg, 2023).

7. Open Access, Licensing, and Directions for Extension

Most datasets are released under academic or open-source licenses (e.g., CC-BY, FAIR principles), with links to repositories and data usage agreements ensuring ethical compliance. Researchers may access resources via Github, Huggingface, project websites, or upon request per IRB protocol (see URLs in dataset descriptions) (Sotudeh et al., 2022, Alghamdi et al., 27 Mar 2025, Shakeel et al., 13 Jul 2025, Mao et al., 6 Dec 2025, Garg et al., 2023).

Extending the coverage and utility of Reddit mental health wellbeing datasets plausibly requires:

  • Expansion to additional diagnoses (DSM-5 anxiety/PTSD, BDI, PHQ-9, GAD-7)
  • Multi-timescale or multi-modal representations (images, voice, explicit self-report questionnaires)
  • Multilingual or cross-cultural annotation
  • Federated learning or differential privacy to further shield user identity
  • Generalization and calibration for clinical populations, potentially by integrating these posts with electronic health record-anchored surveys.

These efforts collectively position the Reddit mental health wellbeing dataset family as a central analytic substrate for computational psychiatry and digital mental health research.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Mental Health Reddit Wellbeing Dataset.