Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

Published 14 Feb 2024 in cs.CL and cs.LG | (2402.09151v2)

Abstract: In the current environment, psychological issues are prevalent and widespread, with social media serving as a key outlet for individuals to share their feelings. This results in the generation of vast quantities of data daily, where negative emotions have the potential to precipitate crisis situations. There is a recognized need for models capable of efficient analysis. While pre-trained LLMs have demonstrated their effectiveness broadly, there's a noticeable gap in pre-trained models tailored for specialized domains like psychology. To address this, we have collected a huge dataset from Chinese social media platforms and enriched it with publicly available datasets to create a comprehensive database encompassing 3.36 million text entries. To enhance the model's applicability to psychological text analysis, we integrated psychological lexicons into the pre-training masking mechanism. Building on an existing Chinese LLM, we performed adaptive training to develop a model specialized for the psychological domain. We evaluated our model's performance across six public datasets, where it demonstrated improvements compared to eight other models. Additionally, in the qualitative comparison experiment, our model provided psychologically relevant predictions given the masked sentences. Due to concerns regarding data privacy, the dataset will not be made publicly available. However, we have made the pre-trained models and codes publicly accessible to the community via: https://github.com/zwzzzQAQ/Chinese-MentalBERT.

Abstract PDF Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper presents a domain-adaptive pre-training approach using a depression lexicon and guided masking to improve Chinese mental health text analysis.
It uses 3.36 million Chinese social media posts to fine-tune Chinese-BERT-wwm-ext, achieving a 2% F1-score improvement over previous baselines.
The study sets a benchmark for mental health NLP by demonstrating significant quantitative gains and qualitative improvements in detecting psychological cues.

Introduction

The study "Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis" (2402.09151) addresses a critical niche in natural language processing by focusing on the mental health domain, specifically targeting Chinese social media text. This work seeks to bridge the gap between general-purpose pre-trained models and the nuanced requirements of psychological text analysis. By leveraging a substantial dataset sourced from Chinese social media and integrating a specialized lexicon, this model aims to more accurately detect and interpret psychological signals within text data.

Methodology

A significant portion of the paper is dedicated to explaining the domain-adaptive pre-training process employed in Chinese MentalBERT. The process begins with Chinese-BERT-wwm-ext as the foundational model. Domain-specific adaptations are then integrated, using data from various Chinese social media sources, capturing a plethora of mental health expressions (Figure 1). To further enhance domain relevance, a guided masking strategy informed by a depression lexicon is employed. This strategy biases the learning process towards key vocabulary within the psychological domain, allowing the model to capture subtle emotional nuances often missed by general-purpose LLMs.

Figure 1: Overview of the domain adaptive pretraining process. The process initiates with the foundational pretrained LLM (Chinese-BERT-wwm-ext), followed by further pretraining with 3.36 millions mental health tweets/comments sourced from social media. The pretraining phase integrates the knowledge from depression lexicon to guide the masking process.

Experimental Results

The paper presents extensive evaluations of Chinese MentalBERT using four public mental health datasets. The results indicate that this model consistently surpasses general LLMs and baselines specific to domains such as financial and medical fields. Notably, the model achieves superior performance in tasks involving cognitive distortion multi-label classification, suicide risk classification, and sentiment analysis. The implementation of the guided masking strategy notably enhances the model's performance compared to traditional random masking methods. This highlights the significant gains achieved by tailoring pre-training processes to specific lexical domains within Chinese text, showcasing an F1-score improvement by approximately 2% over previous baselines.

Qualitative Evaluation

Beyond quantitative analysis, the study explores qualitative evaluation to further establish the model's efficacy. The emphasis is on evaluating the ability of the model to predict psychologically relevant masked words, especially words indicating emotional distress or cognitive distortions. The results provide compelling evidence that the lexicon-guided strategy significantly refines the model's predictions of such masked words, aligning them more closely with the psychological undertones present in context, compared to the outputs from models applying random masking.

Implications and Future Work

The implications of this research stretch across both practical applications in mental health interventions and theoretical advancements in domain-specific model training. By making pre-trained models and code publicly accessible, this work facilitates further research and development in the domain of Chinese mental health. Future developments could see the application of similar strategies across other languages and specific domains, potentially expanding the scope of domain-adaptive pre-training methodologies to new tasks and datasets, thereby enriching AI's capacity to support mental health initiatives globally.

Conclusion

The paper presents a meticulous approach to developing a pre-trained LLM tailored for the Chinese mental health domain. By integrating domain-specific data and adopting a novel lexicon-guided masking mechanism, Chinese MentalBERT sets a new benchmark in psychological text analysis, providing valuable insights into the early detection and understanding of mental health signals from social media data. This work represents a significant step towards improving the applicability of AI in sensitive domains, paving the way for more personalized and accurate interventions.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

Summary

Introduction

Methodology

Experimental Results

Qualitative Evaluation

Implications and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

Summary