- The paper presents a framework and methodology for automating the analysis of Problem-Solving Therapy (PST) sessions using large language models and fine-tuned transformer models applied to a large dataset of annotated therapy transcripts.
- Zero-shot evaluation shows proprietary large language models, such as GPT-4o, achieve strong performance with a weighted F1 score of 0.76 for strategy annotation, exhibiting consistent results despite reduced performance when conversational context was provided.
- Fine-tuned transformer models like MentalBERT and DeBERTa also demonstrate competitive performance (combined F1 ~0.68), highlighting their potential as accessible alternatives for scalable and efficient automated analysis to support real-time clinical applications in mental health.
The paper presents a comprehensive examination of automating the analysis of Problem-Solving Therapy (PST) by leveraging both Large LLMs and transformer-based architectures. The study is grounded in the detailed annotation of anonymized therapy transcripts—240 PST sessions yielding 68,306 dialogue exchanges, from which 14,417 therapist utterances were extracted for analysis. A dual-dimensional coding framework is introduced that comprises traditional PST strategies (termed the “PS core” dimension) and an augmented set of communication strategies, thereby enriching the characterization of therapist-client interactions.
The methodology is two-fold. First, the paper evaluates the performance of several LLMs—including two proprietary models (GPT-4 and GPT-4o) and two open-source models (Llama-3.1 and Yi-1.5)—using a zero-shot prompting paradigm with and without additional conversational context. Notably, the model referred to as GPT-4o achieved a weighted F1 score of 0.76 in the “no context” condition, outperforming its counterparts (with conversely reduced performance when context was provided; for example, GPT-4o’s F1 dropped to 0.61 when supplementary dialogue was included). Entropy calculations, defined as
ui=−j=1∑kPθ(aij∣pij)lnPθ(aij∣pij)
- where Pθ(aij∣pij) denotes the probability of a given prediction—revealed that GPT-4o provided reliable and consistent annotations, with a low mean entropy (0.035) despite exhibiting slightly higher variability than some open-source models.
Second, the study explores fine-tuning three transformer-based models—DeBERTa, MentalBERT, and FLAN-T5—on a subset of 5,000 LLM-annotated therapist utterances with evaluation conducted on 500 human-annotated examples. Here, MentalBERT achieved an F1 score of 0.78 for the PS core strategies, while DeBERTa performed best on communication strategies (F1 = 0.73). Overall, these fine-tuned models reached combined F1 scores around 0.68, indicating that while proprietary LLMs provide strong performance in zero-shot settings, fine-tuning domain-adapted models can yield comparable and potentially more accessible alternatives for sensitive healthcare applications.
Additional contributions include:
- Annotation Framework and Strategy Codebook:
The paper details a coding scheme based on the ADAPT model of PST that delineates five core steps (ranging from establishing a positive mindset to trying out solution plans), while also introducing new categories to capture nuances of interpersonal communication, such as session management and therapeutic engagement.
- Linguistic Feature Analysis:
Using the Linguistic Inquiry and Word Count (LIWC) tool, the paper analyses the lexical and psycholinguistic characteristics of each therapeutic strategy. For instance, “Defining Problems and Goals” is most prevalent, supported by LIWC features such as “reward,” while “Generating Alternative Solutions” is associated with indicators of insight and curiosity. Bigrams extracted from the data further clarify the linguistic markers that differentiate each strategy.
- Progression Across Therapy Sessions:
Analysis of the distribution of strategies across multiple therapy visits reveals a logical evolution in therapist behavior. Early sessions focus on establishing mindset and problem definition; subsequent sessions emphasize the exploration of alternatives and actionable planning; and later phases see an increased concentration on implementing and testing solutions.
- Practical Significance and Limitations:
The paper discusses the potential of integrating automated PST dialogue analysis into real-time therapeutic support systems, which may enhance clinical documentation and decision-making. However, limitations are noted regarding the focus on text-only analysis, potential biases in LLM training data, and the restriction to English-language transcripts, suggesting careful consideration when extending these methods to diverse cultural and clinical contexts.
Overall, the study demonstrates that leveraging LLMs can offer scalable and efficient annotation of therapeutic dialogues, ultimately supporting more precise, data-driven mental health interventions while also highlighting the need for further improvements (particularly in the domain of interpersonal communication) within open-source models and fine-tuning strategies.