Papers
Topics
Authors
Recent
Search
2000 character limit reached

DepFlow: Depression-Conditioned TTS Framework

Updated 2 March 2026
  • DepFlow TTS framework is a three-stage system integrating a depression acoustic encoder, flow-matching synthesis, and prototype-based severity mapping for controlled speech modulation.
  • It employs FiLM-based conditioning to decouple depression cues from linguistic sentiment and speaker identity, ensuring precise, attribute-specific control.
  • The use of the CDoA augmentation procedure enhances depression detection performance by up to 12%, evidencing its robustness against spurious acoustic-semantic correlations.

DepFlow is a three-stage depression-conditioned text-to-speech (TTS) framework designed to generate speech that is controllably modulated for depressive severity, robust to spurious correlations between linguistic sentiment and clinical depression labels. By addressing the strong coupling between sentiment and diagnostic labels observed in widely used depression datasets such as DAIC-WOZ, DepFlow provides an architecture that disentangles depression-relevant acoustic attributes from speaker and content variables, implements a flow-matching TTS synthesis pipeline with precise control over depression severity using FiLM-based conditioning, and leverages a prototype-based severity mapping for smooth manipulation along the depression continuum. The system further enables the construction of the Camouflage Depression-oriented Augmentation (CDoA) dataset, which introduces mismatched acoustic-semantic pairings relevant for robustness in clinical depression detection contexts (Li et al., 1 Jan 2026).

1. Depression Acoustic Encoder (DAE)

The DAE receives as input a raw speech utterance, downsampled to 22.05 kHz, with per-frame features xtR1024x_t \in \mathbb{R}^{1024} extracted by a frozen WavLM-Large module. The feature extraction pipeline includes a linear projection with ReLU and dropout, followed by attention-based statistical pooling to obtain mean (μ\mu) and standard deviation (σ\sigma), yielding concatenated statistics hˉR512\bar{h} \in \mathbb{R}^{512}:

ht=ReLU(Wpxt+bp),αt=softmax(fattn(ht)),μ=tαtht,σ=tαt(htμ)2,hˉ=[μ;σ].h_t = \operatorname{ReLU}(W_p x_t + b_p), \quad \alpha_t = \operatorname{softmax}(f_{attn}(h_t)), \quad \mu = \sum_t \alpha_t h_t, \quad \sigma = \sqrt{ \sum_t \alpha_t (h_t - \mu)^2 }, \quad \bar{h} = [\mu; \sigma].

A shared depression embedding d=MLP(hˉ)R32d = \operatorname{MLP}(\bar{h}) \in \mathbb{R}^{32} is produced using a multi-layer perceptron (structure: FC → LayerNorm → SiLU → dropout → FC). Four downstream heads are attached:

  • Ordinal-regression head: predicts PHQ-8 severity using K=5K=5 levels (4 thresholds), with binary cross-entropy over monotonic thresholds.
  • Speaker-ID head (non-adversarial): classifies speaker identity on dd normalized by L2 norm.
  • Speaker-adversarial head: predicts speaker ID wrapped by a Gradient Reversal Layer (GRL) for adversarial invariance.
  • Content-adversarial head: infers one of CC pseudo-phoneme classes (HuBERT-based) via GRL for content disentanglement.

The combined objective is:

Ltotal=λsupLsup+λidLid+λspkLadvspk+λconLadvcon,L_{total} = \lambda_{sup} L_{sup} + \lambda_{id} L_{id} + \lambda_{spk} L_{adv-spk} + \lambda_{con} L_{adv-con},

with weights λsup=1.0\lambda_{sup}=1.0, λid=0.2\lambda_{id}=0.2, λspk=0.2\lambda_{spk}=0.2, λcon=0.1\lambda_{con}=0.1. Gradient reversal is used to maximize invariance, and losses are combined accordingly.

Empirical disentanglement achieved is evidenced by EER=0.355, similarity gap=0.27 (speaker), MSE=2.83, R2=0.21R^2=0.21, CKA=0.014 (content), and ROC-AUC=0.693 for depression classification.

2. Flow-Matching TTS Synthesis and FiLM-Based Depression Control

The TTS subsystem employs a Matcha-TTS backbone, which generates mel-spectrograms by numerically solving the ODE dx/dt=fθ(x,t)dx/dt = f_\theta(x, t) from Gaussian noise zN(0,I)z \sim \mathcal{N}(0, I) to data space, using flow-matching principles. For a ground-truth mel yy and time-varying mixing α(t)=σmin+t(1σmin)\alpha(t) = \sigma_{min} + t\cdot(1-\sigma_{min}), the interpolation is xt=(1α(t))z+α(t)yx_t = (1-\alpha(t))z + \alpha(t)y, with target velocity u(xt,t)=yzu(x_t, t) = y - z. The model minimizes:

Lfm=Et,y,z[fθ(xt,t)(yz)2].L_{fm} = \mathbb{E}_{t, y, z} \left[ \| f_\theta(x_t, t) - (y - z) \|^2 \right].

Prior and duration losses,

Lprior=1Mnf(i,j)12[(yi,jμi,j)2+log2π],Ldur=MSE(logw,logw^),L_{prior} = \frac{1}{|\mathcal{M}| n_f} \sum_{(i,j)} \frac{1}{2}\left[ (y_{i,j} - \mu_{i,j})^2 + \log 2 \pi \right], \quad L_{dur} = \operatorname{MSE}(\log w, \log \hat{w}),

are combined as Ltts=Ldur+λpLprior+LfmL_{tts} = L_{dur} + \lambda_p L_{prior} + L_{fm} with λp=1.0\lambda_p=1.0.

FiLM-based depression conditioning is realized by mapping the 32-dim depression embedding cdepc_{dep} via a FiLM generator MLP to produce scaling (γi\gamma_i) and bias (βi\beta_i) parameters for each decoder block, modulating activations hih_i:

h^i=γi(cdep)hi+βi(cdep).\hat{h}_i = \gamma_i(c_{dep}) \odot h_i + \beta_i(c_{dep}).

The method enables control over depressive severity while preserving phoneme content and speaker identity, with observed TTS quality WER=13.93%±0.23%=13.93\% \pm 0.23\% (comparable to natural baselines).

3. Prototype-Based Severity Mapping for Controllable Synthesis

For smooth depression severity control, DepFlow introduces a prototype-based interpolation mechanism. Per-speaker embeddings are averaged to subject-level vectors dj(subj)d_j^{(subj)} and grouped by PHQ-8 bins:

pˉk=1NkjSkdj(subj),pk=pˉkpˉk2.\bar{p}_k = \frac{1}{N_k}\sum_{j \in S_k} d_j^{(subj)}, \quad p_k = \frac{\bar{p}_k}{\|\bar{p}_k\|_2}.

A continuous severity scalar α(s)=clip((s12)/12,1,1)\alpha(s) = \operatorname{clip}((s-12)/12, -1,1) is mapped to adjacent prototype bins i,i+1i, i+1 with interpolation weight τ(s)=(α(s)αi)/(αi+1αi)\tau(s) = (\alpha(s) - \alpha_i) / (\alpha_{i+1} - \alpha_i). Spherical linear interpolation (SLERP) is used:

zs=slerp(pi,pi+1;τ)=sin((1τ)Ω)sinΩpi+sin(τΩ)sinΩpi+1,Ω=arccos(pipi+1).z_s = \operatorname{slerp}(p_i, p_{i+1}; \tau) = \frac{\sin((1-\tau)\Omega)}{\sin \Omega}p_i + \frac{\sin(\tau\Omega)}{\sin \Omega}p_{i+1}, \quad \Omega=\arccos(p_i \cdot p_{i+1}).

Severity control metrics demonstrate Concordance Index=0.744 and Spearman’s ρ=0.598\rho=0.598. Consistent acoustic changes with severity include formant frequency (median ρ=0.800\rho=0.800 for both F1 and F2), silence–speech ratio (ρ=0.866\rho=0.866), and other paralinguistic cues.

4. Data Augmentation via CDoA

The Camouflage Depression-oriented Augmentation (CDoA) procedure synthesizes audio exhibiting mismatches between acoustic depression cues and neutral/positive semantic content. Transcriptions from the DAIC-WOZ corpus are sentiment-classified using DeepSeek R1 into benign (positive/neutral) and depressive (negative) banks. For each subject (PHQ score ss):

  1. ss is mapped to depression embedding cdepc_{dep} using SLERP prototypes.
  2. Benign text is sampled from the benign bank.
  3. DepFlow synthesizes speech with depressed acoustics injected into benign text, producing novel acoustic-semantic mismatches.

Sampling achieves 5,760 synthetic utterances, balanced across depressive/healthy conditions with stratified per-severity quotas.

When evaluated on three depression detection models (DepAudioNet, NUSD, HAREN-CTC), CDoA improves subject-level macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming FrAUG, SpecAugment, Mixup, and Instruct-TTS augmentation baselines.

5. Training Regimes and Evaluation Results

DAE is trained on the DAIC-WOZ train+dev splits, using AdamW (lr=1×1041 \times 10^{-4}, weight decay =3×103=3 \times 10^{-3}, batch=64, dropout=0.2), for up to 500 epochs with early stopping on dev AUC. Matcha-TTS is first pretrained on CSTR VCTK and finetuned on DAIC-WOZ with FiLM generator. DAIC-WOZ split: 107/35/47 subjects train/dev/test (strict). No synthetic data is used for validation or testing.

Key results:

Model Macro-F1 (No-aug) Macro-F1 (CDoA)
DepAudioNet 0.482 0.526 +9%
NUSD 0.514 0.577 +12%
HAREN-CTC 0.525 0.551 +5%

TTS and speaker similarity: Natural DAIC-WOZ WER=14.06%, DepFlow synthetic WER=13.93%±0.23%, speaker SIM-o ≈56.97% (stable across severity).

6. Applications, Constraints, and Future Directions

DepFlow supports several applications, including robustifying depression detectors by decoupling sentiment from diagnosis, providing a controllable synthesis platform for depression-aware conversational agents and simulation-based evaluation, and enabling controlled synthesis for perceptual or clinician-in-the-loop studies.

Constraints include:

  • The prototype severity axis is not yet clinically validated nor evaluated in additional languages.
  • Ethical risks of misuse (fabrication of "depressed" speech, reinforcement of cultural stereotypes) necessitate caution.

Future work is envisaged in clinical and perceptual validation of generated cues, expansion to multilingual and demographically diverse settings, and deployment of provenance tracking, watermarking, and governance mechanisms to mitigate misuse (Li et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DepFlow Text-to-Speech (TTS) Framework.