Papers
Topics
Authors
Recent
2000 character limit reached

Depression Acoustic Encoder: Disentangled Modeling

Updated 8 January 2026
  • Depression Acoustic Encoder is a neural module that extracts depression severity embeddings while minimizing speaker identity and verbal content effects through adversarial training.
  • It underpins the creation of a Camouflage Depression–oriented Augmentation dataset by pairing depressed acoustic patterns with mismatched sentiment transcripts to reduce bias.
  • Its embeddings are used in TTS synthesis with FiLM-based conditioning and prototype severity manifolds to enable smooth, controlled modulation of depressive cues.

A Depression Acoustic Encoder (DAE) is a neural module designed to extract embeddings from speech that reflect depression severity while suppressing spurious confounding factors such as speaker identity and verbal content. This approach provides a core building block for more robust, semantically disentangled modeling of depression from vocal signals. The encoder forms the foundation of the DepFlow text-to-speech (TTS) framework for generating synthetic speech samples in which depressive acoustic patterns can be paired with sentiment-stratified transcript content, enabling the generation of datasets with mismatched semantic and acoustic cues—a configuration particularly important for addressing the challenges of detecting "camouflaged depression" (Li et al., 1 Jan 2026).

1. Extracting Depression-Relevant Embeddings

The DAE is trained on clinical conversational data such as the DAIC-WOZ corpus, which is annotated with PHQ-8 diagnostic severity levels. At the acoustic front-end, utterances are represented by frame-level WavLM-Large features. These features are aggregated to produce a 32-dimensional depression embedding d\mathbf{d} for each utterance. During inference, natural utterances uu are passed through the frozen encoder to extract corresponding embeddings du\mathbf{d}_u.

The core design challenge is to ensure that the embedding d\mathbf{d} robustly encodes depression severity while remaining largely invariant to speaker identity and spoken content. This is achieved through adversarial training incorporating multiple head modules:

  • An Ordinal Regression Head predicts PHQ-8 severity thresholds using binary cross-entropy loss:

Lsup=k=1K1[tklogσ(ok)+(1tk)log(1σ(ok))],\mathcal{L}_{\rm sup} = -\sum_{k=1}^{K-1}\Bigl[t_k\log\sigma(o_k)+(1-t_k)\log(1-\sigma(o_k))\Bigr],

where tkt_k are monotonic targets for the ordinal bins.

  • A Speaker ID Head uses cross-entropy to encourage the embedding to retain speaker information.
  • A Speaker Adversarial Head applies a gradient-reversal layer (GRL) and cross-entropy to explicitly discourage encoding of speaker information.
  • A Content Adversarial Head similarly applies GRL and cross-entropy, using HuBERT-derived phoneme labels, to minimize sensitivity to linguistic content.

The total joint optimization problem is expressed as:

minDAE enc + sup. head  maxadv spktextadvconLtotal=λsupLsup+λidLid+λspkLadvspk+λconLadvcon,\min_{\text{DAE enc + sup. head}} \;\max_{\substack{\text{adv spk}\\text{adv con}}} \mathcal{L}_{\rm total} = \lambda_{\rm sup}\,\mathcal{L}_{\rm sup} + \lambda_{\rm id}\,\mathcal{L}_{\rm id} + \lambda_{\rm spk}\,\mathcal{L}_{\rm adv\,-\,spk} + \lambda_{\rm con}\,\mathcal{L}_{\rm adv\,-\,con},

with typical weights λsup=1.0,  λid=0.2,  λspk=0.2,  λcon=0.1\lambda_{\rm sup}=1.0,\;\lambda_{\rm id}=0.2,\;\lambda_{\rm spk}=0.2,\;\lambda_{\rm con}=0.1. The model is optimized using AdamW (lr = 1e-4, wd = 3e-3), batch size 64, and dropout 0.2, with early stopping on a linear depression probe’s dev-set ROC-AUC. In the referenced study, the DAE achieves a depression discriminability of ROC-AUC 0.693 (Li et al., 1 Jan 2026).

2. Role in Camouflage Depression–Oriented Augmentation (CDoA)

The DAE underpins the construction of the Camouflage Depression–oriented Augmentation (CDoA) dataset, which explicitly pairs depressed acoustic patterns with positive or neutral (benign) text. This counteracts the semantic bias present in datasets like DAIC-WOZ, where linguistic sentiment and depression labels are often strongly correlated.

The process involves:

  • Extracting du\mathbf{d}_u embeddings from DAIC-WOZ utterances using the frozen DAE.
  • Using a sentiment classifier to stratify transcripts into benign and depressive text banks.
  • For each depressed subject, sampling benign texts for each severity bin, pairing them with the same speaker embedding and a depression condition embedding cdep\mathbf{c}_\text{dep} derived from DAE outputs.
  • Similarly, for healthy subjects, neutral texts are paired with near-zero depression embeddings.

The resulting CDoA dataset consists of 5,760 utterances with a balanced number of healthy and depressed samples, and with a strong acoustic-semantic mismatch, a configuration nearly absent in the natural corpus.

3. Depression Style Modulation in TTS

The DAE’s embeddings inform modulation of depressive severity in TTS synthesis using a flow-matching decoder with FiLM-based conditioning. Here, the 32-dimensional depression embedding is input into a small MLP, outputting per-layer FiLM scale–shift parameters (γi,βi)(\gamma_i, \beta_i).

At each TTS decoder block, the feature map hi\mathbf{h}_i is modulated as:

hi^=γihi+βi,\widehat{\mathbf{h}_i} = \gamma_i \odot \mathbf{h}_i + \beta_i,

enabling global and uniform injection of depression style without perturbing linguistic or speaker traits. The decoder learns a time-continuous transport from Gaussian noise to mel spectrograms, optimizing a per-frame flow-matching loss, duration MSE, and prior regularization.

This architecture enables fine-grained, interpretable, and clinically validated control of depressive acoustic cues in synthetic speech.

4. Prototype-Based Severity Manifold

To support continuous control of depressive severity, subject-level embeddings dj(subj)\mathbf{d}^{(\rm subj)}_j (averaged over utterances per subject) are grouped by PHQ-8 bins to compute prototype embeddings:

pˉk=1NkjSkdj(subj),pk=pˉkpˉk2.\bar{\mathbf{p}}_k = \frac{1}{N_k} \sum_{j\in\mathcal{S}_k} \mathbf{d}^{(\rm subj)}_j, \quad \mathbf{p}_k = \frac{\bar{\mathbf{p}}_k}{\|\bar{\mathbf{p}}_k\|_2}.

For any target PHQ-8 score ss, the two nearest prototypes pi\mathbf{p}_i, pi+1\mathbf{p}_{i+1} are linearly interpolated in latent space by spherical linear interpolation (SLERP) to yield

cdep(s)=slerp ⁣(pi,pi+1;τ(s)),\mathbf{c}_{\rm dep}(s) = \mathrm{slerp}\!\bigl(\mathbf{p}_i,\mathbf{p}_{i+1};\,\tau(s)\bigr),

supporting continuous, smooth, and interpretable severity manipulation in synthetic samples.

5. Empirical Impact and Evaluation Metrics

The DAE and the CDoA dataset derived from it are evaluated in depression detection settings using three baselines: DepAudioNet (CNN+LSTM), NUSD (ECAPA-TDNN with speaker-disentangling branches), and HAREN-CTC (hierarchical WavLM+CTC). Metrics include subject-level Macro-F1, sensitivity, specificity, embedding disentanglement (speaker EER, Similarity Gap, content MSE, R2R^2, CKA), depression ROC-AUC, synthesized TTS word error rate (WER, via Whisper-Large), speaker similarity (SIM-o), and controllability via Concordance Index and Spearman’s ρ\rho.

Augmentation with CDoA yields Macro-F1 improvements of 0.526 (+9.1%) for DepAudioNet, 0.577 (+12.3%) for NUSD, and 0.551 (+5.0%) for HAREN-CTC compared to baseline, as well as more balanced sensitivity and specificity under "camouflaged" evaluation. Ablation studies confirm the necessity of clinically informed acoustic styles and FiLM-based conditioning for maintaining controllable severity and robust detection (Li et al., 1 Jan 2026).

6. Broader Implications and Future Directions

By providing embeddings that are disentangled from semantic and speaker confounds, the DAE enables construction of datasets and synthesis systems that force downstream depression detectors to rely on invariant vocal biomarkers rather than text sentiment. This substantially improves robustness in scenarios such as camouflaged depression.

Beyond data augmentation, the approach offers a controllable depression synthesizer for:

  • Simulation of clinical dialogues where real depressed speech is scarce.
  • Training conversational agents capable of empathically modulating spoken cues.
  • Clinical perceptual studies validating acoustic severity levels.

Planned extensions encompass multilingual adaptation, demographic balancing, clinician-in-the-loop prototype calibration, and audio watermarking for safe usage monitoring. Reproduction requires access to DAIC-WOZ, pretrained WavLM-Large and HuBERT, the Matcha-TTS codebase, implementation of GRL branches, and standard evaluation pipelines (Li et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Depression Acoustic Encoder.