SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions

Published 13 Jun 2018 in cs.CL | (1806.05258v2)

Abstract: Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported diagnoses of nine different mental health conditions, and obtain high-quality labeled data without the need for manual labelling. We introduce the SMHD (Self-reported Mental Health Diagnoses) dataset and make it available. SMHD is a novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users. We examine distinctions in users' language, as measured by linguistic and psychological variables. We further explore text classification methods to identify individuals with mental conditions through their language.

Abstract PDF Upgrade to Chat

Citations (129)

View on Semantic Scholar

Summary

The paper introduces the SMHD dataset, a significant resource that uses high-precision self-reported diagnostic patterns to classify mental health conditions.
The methodology highlights FastText as the most effective text classification tool for distinguishing language patterns of users with and without mental health conditions.
The resource, up to two orders of magnitude larger than previous datasets, enables scalable analyses and advances automated mental health detection.

SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions

Introduction

The paper "SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions" (1806.05258) introduces a significant advancement in the field of mental health research by presenting the SMHD dataset. This dataset provides a substantial collection of social media posts specifically curated to examine language patterns associated with mental health conditions. This resource is designed to help researchers uncover linguistic and psychological signals that may be indicative of various mental health disorders, thus streamlining the process of analyzing self-reported mental health data without the necessity for manual labeling.

Dataset Construction

The SMHD dataset is meticulously constructed using high-precision, self-reported diagnostic patterns. This method allows for the extraction of high-quality labeled data, circumventing the extensive labor typically required for manual annotation. The dataset comprises posts from users diagnosed with nine different mental health conditions, as well as posts from matched control users. The paper claims a notable achievement in size and scope, suggesting that SMHD is up to two orders of magnitude larger than existing datasets in this research domain, making it a vital tool for large-scale linguistic analysis and comparison.

Language Use Analysis

This paper emphasizes the exploration of distinctions in language use between individuals diagnosed with mental health conditions and those who are not. Utilizing linguistic and psychological variables, the authors explore the unique language patterns exhibited by users with specific conditions. This analysis provides insights into the psychological states correlating with enhanced or reduced usage of particular linguistic constructs, contributing to a richer understanding of mental health as it manifests in language.

Methodology for Text Classification

Among several text classification methodologies evaluated, FastText is identified as the most effective approach for distinguishing individuals with mental health conditions based on their language use. This finding underlines the importance of selecting appropriate machine learning techniques tailored to the nature of social media text data. The efficacy of FastText in this context highlights its potential utility in other applications involving large-scale text classification and feature extraction within similar datasets.

Implications and Future Work

The availability of the SMHD dataset is posed to significantly impact ongoing research efforts by fostering reproducibility and enabling comparative studies. The dataset's robust size and the authors' transparent methodology allow for diverse applications in both theoretical explorations and practical diagnostic tools development. Future research may explore developing models that can generalize better across various platforms and explore the integration of multimodal data sources for a more holistic understanding of mental health as reflected in digital footprints.

Conclusion

The SMHD dataset presented in this paper stands as a cornerstone for computational exploration of mental health through language analysis. The resource promises to catalyze a broad range of research initiatives aimed at understanding the intricacies of mental health conditions as expressed online and facilitates the development of automated tools for early detection and intervention in mental health disorders. Its potential for expanding the field's scope and enhancing methodological rigor makes SMHD an invaluable contribution to the computational social science and health informatics communities.

Markdown Report Issue