DepFlow: Depression-Conditioned TTS Framework

Updated 2 March 2026

DepFlow TTS framework is a three-stage system integrating a depression acoustic encoder, flow-matching synthesis, and prototype-based severity mapping for controlled speech modulation.
It employs FiLM-based conditioning to decouple depression cues from linguistic sentiment and speaker identity, ensuring precise, attribute-specific control.
The use of the CDoA augmentation procedure enhances depression detection performance by up to 12%, evidencing its robustness against spurious acoustic-semantic correlations.

DepFlow is a three-stage depression-conditioned text-to-speech (TTS) framework designed to generate speech that is controllably modulated for depressive severity, robust to spurious correlations between linguistic sentiment and clinical depression labels. By addressing the strong coupling between sentiment and diagnostic labels observed in widely used depression datasets such as DAIC-WOZ, DepFlow provides an architecture that disentangles depression-relevant acoustic attributes from speaker and content variables, implements a flow-matching TTS synthesis pipeline with precise control over depression severity using FiLM-based conditioning, and leverages a prototype-based severity mapping for smooth manipulation along the depression continuum. The system further enables the construction of the Camouflage Depression-oriented Augmentation (CDoA) dataset, which introduces mismatched acoustic-semantic pairings relevant for robustness in clinical depression detection contexts (Li et al., 1 Jan 2026).

1. Depression Acoustic Encoder (DAE)

The DAE receives as input a raw speech utterance, downsampled to 22.05 kHz, with per-frame features $x_t \in \mathbb{R}^{1024}$ extracted by a frozen WavLM-Large module. The feature extraction pipeline includes a linear projection with ReLU and dropout, followed by attention-based statistical pooling to obtain mean ( $\mu$ ) and standard deviation ( $\sigma$ ), yielding concatenated statistics $\bar{h} \in \mathbb{R}^{512}$ :

$h_t = \operatorname{ReLU}(W_p x_t + b_p), \quad \alpha_t = \operatorname{softmax}(f_{attn}(h_t)), \quad \mu = \sum_t \alpha_t h_t, \quad \sigma = \sqrt{ \sum_t \alpha_t (h_t - \mu)^2 }, \quad \bar{h} = [\mu; \sigma].$

A shared depression embedding $d = \operatorname{MLP}(\bar{h}) \in \mathbb{R}^{32}$ is produced using a multi-layer perceptron (structure: FC → LayerNorm → SiLU → dropout → FC). Four downstream heads are attached:

Ordinal-regression head: predicts PHQ-8 severity using $K=5$ levels (4 thresholds), with binary cross-entropy over monotonic thresholds.
Speaker-ID head (non-adversarial): classifies speaker identity on $d$ normalized by L2 norm.
Speaker-adversarial head: predicts speaker ID wrapped by a Gradient Reversal Layer (GRL) for adversarial invariance.
Content-adversarial head: infers one of $C$ pseudo-phoneme classes (HuBERT-based) via GRL for content disentanglement.

The combined objective is:

$L_{total} = \lambda_{sup} L_{sup} + \lambda_{id} L_{id} + \lambda_{spk} L_{adv-spk} + \lambda_{con} L_{adv-con},$

with weights $\lambda_{sup}=1.0$ , $\lambda_{id}=0.2$ , $\lambda_{spk}=0.2$ , $\lambda_{con}=0.1$ . Gradient reversal is used to maximize invariance, and losses are combined accordingly.

Empirical disentanglement achieved is evidenced by EER=0.355, similarity gap=0.27 (speaker), MSE=2.83, $R^2=0.21$ , CKA=0.014 (content), and ROC-AUC=0.693 for depression classification.

2. Flow-Matching TTS Synthesis and FiLM-Based Depression Control

The TTS subsystem employs a Matcha-TTS backbone, which generates mel-spectrograms by numerically solving the ODE $dx/dt = f_\theta(x, t)$ from Gaussian noise $z \sim \mathcal{N}(0, I)$ to data space, using flow-matching principles. For a ground-truth mel $y$ and time-varying mixing $\alpha(t) = \sigma_{min} + t\cdot(1-\sigma_{min})$ , the interpolation is $x_t = (1-\alpha(t))z + \alpha(t)y$ , with target velocity $u(x_t, t) = y - z$ . The model minimizes:

$L_{fm} = \mathbb{E}_{t, y, z} \left[ \| f_\theta(x_t, t) - (y - z) \|^2 \right].$

Prior and duration losses,

$L_{prior} = \frac{1}{|\mathcal{M}| n_f} \sum_{(i,j)} \frac{1}{2}\left[ (y_{i,j} - \mu_{i,j})^2 + \log 2 \pi \right], \quad L_{dur} = \operatorname{MSE}(\log w, \log \hat{w}),$

are combined as $L_{tts} = L_{dur} + \lambda_p L_{prior} + L_{fm}$ with $\lambda_p=1.0$ .

FiLM-based depression conditioning is realized by mapping the 32-dim depression embedding $c_{dep}$ via a FiLM generator MLP to produce scaling ( $\gamma_i$ ) and bias ( $\beta_i$ ) parameters for each decoder block, modulating activations $h_i$ :

$\hat{h}_i = \gamma_i(c_{dep}) \odot h_i + \beta_i(c_{dep}).$

The method enables control over depressive severity while preserving phoneme content and speaker identity, with observed TTS quality WER $=13.93\% \pm 0.23\%$ (comparable to natural baselines).

3. Prototype-Based Severity Mapping for Controllable Synthesis

For smooth depression severity control, DepFlow introduces a prototype-based interpolation mechanism. Per-speaker embeddings are averaged to subject-level vectors $d_j^{(subj)}$ and grouped by PHQ-8 bins:

$\bar{p}_k = \frac{1}{N_k}\sum_{j \in S_k} d_j^{(subj)}, \quad p_k = \frac{\bar{p}_k}{\|\bar{p}_k\|_2}.$

A continuous severity scalar $\alpha(s) = \operatorname{clip}((s-12)/12, -1,1)$ is mapped to adjacent prototype bins $i, i+1$ with interpolation weight $\tau(s) = (\alpha(s) - \alpha_i) / (\alpha_{i+1} - \alpha_i)$ . Spherical linear interpolation (SLERP) is used:

$z_s = \operatorname{slerp}(p_i, p_{i+1}; \tau) = \frac{\sin((1-\tau)\Omega)}{\sin \Omega}p_i + \frac{\sin(\tau\Omega)}{\sin \Omega}p_{i+1}, \quad \Omega=\arccos(p_i \cdot p_{i+1}).$

Severity control metrics demonstrate Concordance Index=0.744 and Spearman’s $\rho=0.598$ . Consistent acoustic changes with severity include formant frequency (median $\rho=0.800$ for both F1 and F2), silence–speech ratio ( $\rho=0.866$ ), and other paralinguistic cues.

4. Data Augmentation via CDoA

The Camouflage Depression-oriented Augmentation (CDoA) procedure synthesizes audio exhibiting mismatches between acoustic depression cues and neutral/positive semantic content. Transcriptions from the DAIC-WOZ corpus are sentiment-classified using DeepSeek R1 into benign (positive/neutral) and depressive (negative) banks. For each subject (PHQ score $s$ ):

$s$ is mapped to depression embedding $c_{dep}$ using SLERP prototypes.
Benign text is sampled from the benign bank.
DepFlow synthesizes speech with depressed acoustics injected into benign text, producing novel acoustic-semantic mismatches.

Sampling achieves 5,760 synthetic utterances, balanced across depressive/healthy conditions with stratified per-severity quotas.

When evaluated on three depression detection models (DepAudioNet, NUSD, HAREN-CTC), CDoA improves subject-level macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming FrAUG, SpecAugment, Mixup, and Instruct-TTS augmentation baselines.

5. Training Regimes and Evaluation Results

DAE is trained on the DAIC-WOZ train+dev splits, using AdamW (lr= $1 \times 10^{-4}$ , weight decay $=3 \times 10^{-3}$ , batch=64, dropout=0.2), for up to 500 epochs with early stopping on dev AUC. Matcha-TTS is first pretrained on CSTR VCTK and finetuned on DAIC-WOZ with FiLM generator. DAIC-WOZ split: 107/35/47 subjects train/dev/test (strict). No synthetic data is used for validation or testing.

Key results:

Model	Macro-F1 (No-aug)	Macro-F1 (CDoA)	%Δ
DepAudioNet	0.482	0.526	+9%
NUSD	0.514	0.577	+12%
HAREN-CTC	0.525	0.551	+5%

TTS and speaker similarity: Natural DAIC-WOZ WER=14.06%, DepFlow synthetic WER=13.93%±0.23%, speaker SIM-o ≈56.97% (stable across severity).

6. Applications, Constraints, and Future Directions

DepFlow supports several applications, including robustifying depression detectors by decoupling sentiment from diagnosis, providing a controllable synthesis platform for depression-aware conversational agents and simulation-based evaluation, and enabling controlled synthesis for perceptual or clinician-in-the-loop studies.

Constraints include:

The prototype severity axis is not yet clinically validated nor evaluated in additional languages.
Ethical risks of misuse (fabrication of "depressed" speech, reinforcement of cultural stereotypes) necessitate caution.

Future work is envisaged in clinical and perceptual validation of generated cues, expansion to multilingual and demographically diverse settings, and deployment of provenance tracking, watermarking, and governance mechanisms to mitigate misuse (Li et al., 1 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DepFlow Text-to-Speech (TTS) Framework.

DepFlow: Depression-Conditioned TTS Framework

1. Depression Acoustic Encoder (DAE)

2. Flow-Matching TTS Synthesis and FiLM-Based Depression Control

3. Prototype-Based Severity Mapping for Controllable Synthesis

4. Data Augmentation via CDoA

5. Training Regimes and Evaluation Results

6. Applications, Constraints, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DepFlow: Depression-Conditioned TTS Framework

1. Depression Acoustic Encoder (DAE)

2. Flow-Matching TTS Synthesis and FiLM-Based Depression Control

3. Prototype-Based Severity Mapping for Controllable Synthesis

4. Data Augmentation via CDoA

5. Training Regimes and Evaluation Results

6. Applications, Constraints, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research