- The paper demonstrates that preference-based optimization, notably ORPO with class rebalancing, outperforms traditional baselines and SFT methods.
- The study rigorously compares classical models, LoRA/QLoRA fine-tuning, and diverse optimizers, highlighting the impact of objective design and class-balancing.
- Robust evaluation across primary and secondary mental health tasks confirms that optimization strategy is crucial for clinical text classification.
Comparative Analysis of Optimization Strategies for Mental Health Text Classification
Introduction
The evaluation of mental health conditions from text is an established high-impact task in NLP, with clinical utility in screening, triage, and risk assessment. Despite advances in model architectures, the selection of effective adaptation and optimization strategies for such tasks remains ambiguous. This paper provides a methodologically rigorous comparative study of baseline classifiers, parameter-efficient supervised fine-tuning (LoRA/QLoRA), and preference-optimization approaches (DPO, ORPO, KTO) for mental health text classification using the DAIC-WOZ dataset and derived PHQ-4 style screening targets (2604.00773). The work foregrounds the optimization pathway—explicitly disentangling gains due to tuning, objective formulation, adaptation method, optimizer choice, and class-balancing—over architectural novelty.
Baseline Establishment: Classical and Encoder Models
The study begins with robust classical and transformer encoder baselines. XGBoost, trained on sparse features, competes with BERT-family transformers (BERT-base, DistilBERT, RoBERTa) across various context-window configurations. Macro-F1 performance for XGBoost peaks at 0.3375, while optimized temperature-scaled BERT-base achieves a best macro-F1 of 0.3518, also demonstrating superior mean performance across windows and stability under changing context sizes. This underscores that, for the PHQ-4 derived multi-class task, vanilla discriminative encoder models remain highly competitive. In contrast, smaller or less robust architectures (DistilBERT, RoBERTa) exhibit greater variability and inferior peak/mean metrics, especially as context expands.
Parameter-Efficient Supervised Fine-Tuning: LoRA/QLoRA
Shifting to SFT with LoRA and QLoRA adapters, the experimental matrix spans discriminative/generative objectives, three optimizers (AdamW, Adafactor, Adam8bit), and multiple output schemas (label_only, label_confidence, label_rationale). A key finding is the superior performance of generative objectives paired with LoRA/QLoRA adapters. The QLoRA vanilla generative setup attains a best macro-F1 of 0.3399, and optimized LoRA generative with Adafactor yields high mean stability. Notably, adapter choice does not yield dramatic changes in performance when compared to the influence of objective and optimizer interactions. Control of chunk/window parameters is crucial: mid-to-large windows (384–512) are consistently superior, whereas output schemas providing auxiliary confidence or rationale tend to introduce variance and degrade stability.
Preference-Based Optimization: DPO, ORPO, KTO
The preference optimization stage interrogates whether post-SFT preference-based methods confer robust additional improvements. Three approaches are examined:
Results indicate strong method-level variance. In vanilla setups, ORPO significantly outperforms DPO (macro-F1: 0.3181 vs 0.2474), with KTO consistently underperforming (0.1200). However, application of class-rebalancing in preference training yields substantial gains for ORPO and DPO. Rebalanced ORPO achieves a macro-F1 of 0.3798, the highest among all methods and profiles, while rebalanced DPO with QLoRA closes much of the gap (0.3493). KTO remains unaffected by rebalancing, further indicating objective-intrinsic limitations for this task. These results demonstrate that preference-based optimization requires explicit attention to class distribution—the gains are neither automatic nor uniform across objectives.
Secondary Transfer: Derived Anxiety and Depression Tasks
To assess clinical relevance beyond the 4-class setup, derived binary anxiety and depression outcomes are evaluated using the same prediction outputs. The relative ranking observed in the main task persists in these secondary analyses: optimized BERT maintains the lead (macro-F1 0.7199 and 0.7043 for anxiety and depression, respectively), followed by vanilla SFT and rebalanced ORPO. Preference optimization methods exhibit the same method dependency, with ORPO robust across both secondary outcomes and KTO remaining ineffective.
Implications and Theoretical Perspectives
The central theoretical implication is that optimization strategy supersedes raw architectural choice for robust mental health NLP. While strong vanilla baselines (BERT, XGBoost) provide a competitive foundation, PEFT methods (LoRA/QLoRA) can close the gap under the correct objective/optimizer regimes, but do not universally dominate. Most decisively, preference-based optimization is not monolithically beneficial: gains are highly contingent on objective design (ORPO > DPO ≫ KTO) and class-balance intervention. This contradicts prevailing assumptions that preference learning directly and uniformly strengthens clinical text classification. The empirical narrative advanced here calls for rigorous ablation and evaluation of optimization pathways, not just architecture selection.
Practically, the study recommends initializing with transparent, interpretable baselines, conducting controlled SFT with explicit profiling of objectives and optimizers, and deploying preference optimization only after demonstrated, method-consistent improvement—especially with explicit class balancing in the data regime. These findings are consistent with and extend the evidence base regarding the context- and objective-sensitivity of mental health NLP [jmirai2026llmhealth, 11106680].
Future Directions
This framework motivates further investigation of (1) preference-learning methods that are inherently robust to class imbalance, (2) transfer and generalization under more severe data scarcity, and (3) cross-domain application to other clinically sensitive NLP tasks. Additionally, nuanced evaluation protocols, as utilized here (reporting both peaks and means, analyzing window and chunk effects), should be extended to more complex, multi-task, and federated mental health diagnostics.
Conclusion
This work provides a definitive empirical optimization roadmap for mental health text classification. The efficacy of model family selection is secondary to rigorous optimization-pathway analysis. Preference-based training, specifically ORPO with class rebalancing, is the only approach to clearly surpass the best encoder and SFT configurations in this setting. Other preference methods or unbalanced approaches offer no consistent gain and can significantly underperform. Such ablation-driven methodology is critical for supplying actionable model selection strategies and for the continued clinical translation of NLP pipelines in sensitive domains (2604.00773).