Depression Acoustic Encoder: Disentangled Modeling

This presentation explores a neural approach to extracting depression severity from speech while suppressing confounding factors like speaker identity and verbal content. The Depression Acoustic Encoder uses adversarial training to produce embeddings that capture vocal biomarkers of depression, enabling the creation of synthetic datasets where acoustic cues and linguistic sentiment are deliberately mismatched—a configuration critical for detecting camouflaged depression and building more robust clinical systems.
Script
Speech carries hidden signatures of mental health, but traditional systems confuse depression with the words someone chooses or whose voice is speaking. The Depression Acoustic Encoder isolates the vocal biomarkers of depression severity while stripping away speaker identity and linguistic content, creating embeddings that reveal what conventional methods miss.
Natural clinical datasets present a threefold problem. Depression labels correlate strongly with the sentiment of spoken words, so detectors learn to cheat by reading transcripts. Speaker identity bleeds into acoustic embeddings, confounding severity with voice characteristics. And the actual words spoken obscure the prosodic and timbral cues that signal depression across all content. To build robust systems, we must disentangle these factors at the encoding stage.
The encoder achieves this through adversarial multi-task learning.
The encoder accepts frame-level WavLM features and produces a 32-dimensional depression embedding. It simultaneously trains four heads: an ordinal regression head predicting PHQ-8 severity, a speaker ID head that encourages identity signal, and two adversarial heads using gradient reversal layers to actively suppress speaker and phoneme information. This min-max game forces the embedding to encode depression while rejecting confounds, achieving a discriminability of 0.693 ROC-AUC on clinical data.
Using frozen encoder embeddings, we construct a synthetic dataset that traditional methods cannot handle: depressed vocal patterns paired with benign transcripts, and healthy speech paired with neutral language. This Camouflage Depression oriented Augmentation dataset contains 5,760 utterances where acoustic and semantic cues are deliberately misaligned, forcing detectors to rely on the invariant vocal biomarkers rather than text sentiment. It's the acoustic equivalent of a patient who sounds depressed while saying everything is fine.
These embeddings also enable precise control of depression severity in synthetic speech.
The 32-dimensional embeddings drive a flow-matching text-to-speech decoder through Feature-wise Linear Modulation. At each block, the embedding generates scale and shift parameters that uniformly inject depressive acoustic style without perturbing speaker or linguistic traits. Subject-level embeddings grouped by severity bin form prototypes; spherical interpolation between prototypes produces a smooth, continuous severity manifold, enabling fine-grained control from healthy to severely depressed speech while holding everything else constant.
When three baseline depression detectors are augmented with the camouflaged dataset, performance jumps dramatically. DepAudioNet gains 9.1 percent in Macro-F1, NUSD improves by 12.3 percent, and HAREN-CTC sees a 5 percent lift. More importantly, sensitivity and specificity become balanced under camouflaged evaluation conditions, meaning the models can now detect depression even when linguistic sentiment provides no useful signal. The disentangled embeddings force systems to learn the acoustic biomarkers that generalize.
Disentangled modeling addresses a core vulnerability in mental health AI: the tendency to rely on easily gamed textual cues. By isolating vocal biomarkers, the encoder enables detectors that work when patients camouflage their depression linguistically, a scenario common in clinical practice. It also provides a principled way to synthesize training data for rare conditions, run perceptual studies with clinician-validated severity levels, and train conversational agents to modulate empathic vocal cues. The architecture transforms depression detection from pattern matching on sentiment to true acoustic biomarker discovery.
When speech signals are disentangled, the voice reveals what words conceal. Visit EmergentMind.com to explore more research at the intersection of clinical AI and robust representation learning, and create your own video presentations.