DepFlow: Depression-Conditioned TTS Framework
- DepFlow TTS framework is a three-stage system integrating a depression acoustic encoder, flow-matching synthesis, and prototype-based severity mapping for controlled speech modulation.
- It employs FiLM-based conditioning to decouple depression cues from linguistic sentiment and speaker identity, ensuring precise, attribute-specific control.
- The use of the CDoA augmentation procedure enhances depression detection performance by up to 12%, evidencing its robustness against spurious acoustic-semantic correlations.
DepFlow is a three-stage depression-conditioned text-to-speech (TTS) framework designed to generate speech that is controllably modulated for depressive severity, robust to spurious correlations between linguistic sentiment and clinical depression labels. By addressing the strong coupling between sentiment and diagnostic labels observed in widely used depression datasets such as DAIC-WOZ, DepFlow provides an architecture that disentangles depression-relevant acoustic attributes from speaker and content variables, implements a flow-matching TTS synthesis pipeline with precise control over depression severity using FiLM-based conditioning, and leverages a prototype-based severity mapping for smooth manipulation along the depression continuum. The system further enables the construction of the Camouflage Depression-oriented Augmentation (CDoA) dataset, which introduces mismatched acoustic-semantic pairings relevant for robustness in clinical depression detection contexts (Li et al., 1 Jan 2026).
1. Depression Acoustic Encoder (DAE)
The DAE receives as input a raw speech utterance, downsampled to 22.05 kHz, with per-frame features extracted by a frozen WavLM-Large module. The feature extraction pipeline includes a linear projection with ReLU and dropout, followed by attention-based statistical pooling to obtain mean () and standard deviation (), yielding concatenated statistics :
A shared depression embedding is produced using a multi-layer perceptron (structure: FC → LayerNorm → SiLU → dropout → FC). Four downstream heads are attached:
- Ordinal-regression head: predicts PHQ-8 severity using levels (4 thresholds), with binary cross-entropy over monotonic thresholds.
- Speaker-ID head (non-adversarial): classifies speaker identity on normalized by L2 norm.
- Speaker-adversarial head: predicts speaker ID wrapped by a Gradient Reversal Layer (GRL) for adversarial invariance.
- Content-adversarial head: infers one of pseudo-phoneme classes (HuBERT-based) via GRL for content disentanglement.
The combined objective is:
with weights , , , . Gradient reversal is used to maximize invariance, and losses are combined accordingly.
Empirical disentanglement achieved is evidenced by EER=0.355, similarity gap=0.27 (speaker), MSE=2.83, , CKA=0.014 (content), and ROC-AUC=0.693 for depression classification.
2. Flow-Matching TTS Synthesis and FiLM-Based Depression Control
The TTS subsystem employs a Matcha-TTS backbone, which generates mel-spectrograms by numerically solving the ODE from Gaussian noise to data space, using flow-matching principles. For a ground-truth mel and time-varying mixing , the interpolation is , with target velocity . The model minimizes:
Prior and duration losses,
are combined as with .
FiLM-based depression conditioning is realized by mapping the 32-dim depression embedding via a FiLM generator MLP to produce scaling () and bias () parameters for each decoder block, modulating activations :
The method enables control over depressive severity while preserving phoneme content and speaker identity, with observed TTS quality WER (comparable to natural baselines).
3. Prototype-Based Severity Mapping for Controllable Synthesis
For smooth depression severity control, DepFlow introduces a prototype-based interpolation mechanism. Per-speaker embeddings are averaged to subject-level vectors and grouped by PHQ-8 bins:
A continuous severity scalar is mapped to adjacent prototype bins with interpolation weight . Spherical linear interpolation (SLERP) is used:
Severity control metrics demonstrate Concordance Index=0.744 and Spearman’s . Consistent acoustic changes with severity include formant frequency (median for both F1 and F2), silence–speech ratio (), and other paralinguistic cues.
4. Data Augmentation via CDoA
The Camouflage Depression-oriented Augmentation (CDoA) procedure synthesizes audio exhibiting mismatches between acoustic depression cues and neutral/positive semantic content. Transcriptions from the DAIC-WOZ corpus are sentiment-classified using DeepSeek R1 into benign (positive/neutral) and depressive (negative) banks. For each subject (PHQ score ):
- is mapped to depression embedding using SLERP prototypes.
- Benign text is sampled from the benign bank.
- DepFlow synthesizes speech with depressed acoustics injected into benign text, producing novel acoustic-semantic mismatches.
Sampling achieves 5,760 synthetic utterances, balanced across depressive/healthy conditions with stratified per-severity quotas.
When evaluated on three depression detection models (DepAudioNet, NUSD, HAREN-CTC), CDoA improves subject-level macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming FrAUG, SpecAugment, Mixup, and Instruct-TTS augmentation baselines.
5. Training Regimes and Evaluation Results
DAE is trained on the DAIC-WOZ train+dev splits, using AdamW (lr=, weight decay , batch=64, dropout=0.2), for up to 500 epochs with early stopping on dev AUC. Matcha-TTS is first pretrained on CSTR VCTK and finetuned on DAIC-WOZ with FiLM generator. DAIC-WOZ split: 107/35/47 subjects train/dev/test (strict). No synthetic data is used for validation or testing.
Key results:
| Model | Macro-F1 (No-aug) | Macro-F1 (CDoA) | %Δ |
|---|---|---|---|
| DepAudioNet | 0.482 | 0.526 | +9% |
| NUSD | 0.514 | 0.577 | +12% |
| HAREN-CTC | 0.525 | 0.551 | +5% |
TTS and speaker similarity: Natural DAIC-WOZ WER=14.06%, DepFlow synthetic WER=13.93%±0.23%, speaker SIM-o ≈56.97% (stable across severity).
6. Applications, Constraints, and Future Directions
DepFlow supports several applications, including robustifying depression detectors by decoupling sentiment from diagnosis, providing a controllable synthesis platform for depression-aware conversational agents and simulation-based evaluation, and enabling controlled synthesis for perceptual or clinician-in-the-loop studies.
Constraints include:
- The prototype severity axis is not yet clinically validated nor evaluated in additional languages.
- Ethical risks of misuse (fabrication of "depressed" speech, reinforcement of cultural stereotypes) necessitate caution.
Future work is envisaged in clinical and perceptual validation of generated cues, expansion to multilingual and demographically diverse settings, and deployment of provenance tracking, watermarking, and governance mechanisms to mitigate misuse (Li et al., 1 Jan 2026).