Not Just Depressed: Bipolar Disorder Prediction on Reddit (1811.04655v2)

Published 12 Nov 2018 in cs.CL and cs.SI

Abstract: Bipolar disorder, an illness characterized by manic and depressive episodes, affects more than 60 million people worldwide. We present a preliminary study on bipolar disorder prediction from user-generated text on Reddit, which relies on users' self-reported labels. Our benchmark classifiers for bipolar disorder prediction outperform the baselines and reach accuracy and F1-scores of above 86%. Feature analysis shows interesting differences in language use between users with bipolar disorders and the control group, including differences in the use of emotion-expressive words.

Authors (3)

Matej Gjurković (2 papers)
Jan Šnajder (24 papers)
Ivan Sekulić (12 papers)

Citations (42)

View on Semantic Scholar

Summary

Insights into "Not Just Depressed: Bipolar Disorder Prediction on Reddit"

This paper investigates the predictive modeling of bipolar disorder using self-reported data from Reddit, with the primary objective of employing user-generated text to distinguish individuals with bipolar disorder from a control group. This research enhances existing frameworks in mental health prediction by leveraging the psycholinguistic and lexical features of online interactions. The authors present robust classifiers with impressive performance metrics, achieving accuracy and F1-scores exceeding 86%.

Methodological Approach

The paper is grounded on capturing textual nuances indicative of bipolar disorder by analyzing words and phrases in user comments. The authors utilized Reddit as a data source due to its anonymity and plethora of mental health discussions. They meticulously constructed a dataset by identifying users self-declaring bipolar disorder through specific subreddit participation and comment content. To ensure reliability, they pruned the dataset to mitigate bias and included only substantive contributions, exceeding 1000 words, from each user.

Feature Engineering and Classification Models

A crucial contribution of this paper is the feature selection strategy, which includes psycholinguistic variables, tf-idf based lexical features, and additional user behavior indicators from Reddit. The authors evaluate their models—support vector machine, logistic regression, and random forest ensemble—through a well-defined nested cross-validation procedure. Notably, the random forest classifier demonstrated superior accuracy and robustness in bipolar disorder prediction.

When dissecting feature efficacy, tf-idf features emerged as significant contributors to model performance, surpassing other psycholinguistic tools like LIWC and Empath. This highlights the importance of nuanced textual representations in distinguishing mental health conditions.

Emotional and Linguistic Insights

The paper explores the linguistic characteristics associated with bipolar disorder, highlighting a pronounced usage of first-person pronouns and emotion-laden terms in the bipolar group. These users exhibited higher lexical variances over time, reflecting the disorder's cyclical nature. Such linguistic patterns align with existing psychological theories regarding the emotive and expressive tendencies in bipolar individuals. Furthermore, increased usage of positive emotion words among bipolar users raises questions about the textual manifestations of manic episodes, offering avenues for future explanatory models.

Implications and Future Directions

This research extends the frontier of digital mental health diagnostics, underscoring practical implications for early detection systems that could integrate seamlessly into social media platforms. By demonstrating efficacy in predicting bipolar disorder from Reddit data, the paper suggests broader applications of NLP in mental health surveillance, warranting further exploration into automating mental health assessments. The insights on emotional variability and user-level analyses invite further investigations to differentiate between manic and depressive phases, potentially guiding interventions and therapeutic strategies.

The paper sets a benchmark for future endeavors in automatic mental health analysis and underscores the utility of integrating psycholinguistic analysis with digital text streams. Future work could refine predictive models by incorporating temporal dynamics and exploring cross-platform linguistic patterns to improve prediction specificity and generalize findings across diverse user bases. The paper's methodological rigor and analytical depth contribute valuably to the discourse on mental health diagnostics using online user-generated content.

PDF Markdown