Papers
Topics
Authors
Recent
Search
2000 character limit reached

Camouflage Depression–Oriented Augmentation

Updated 2 March 2026
  • Camouflage Depression–Oriented Augmentation (CDoA) is a data-centric framework that decouples depressive acoustic markers from underlying text sentiment.
  • It utilizes the DepFlow architecture with disentangled depression embeddings, FiLM-based TTS, and prototype severity mapping to generate camouflaged depression speech samples.
  • Evaluations demonstrate that CDoA improves detection robustness by reducing sentiment-induced biases, achieving performance improvements of up to 12%.

Camouflage Depression–Oriented Augmentation (CDoA) refers to a data-centric framework designed to overcome sentiment-induced biases in automatic depression detection from speech. Rather than relying on superficial correlations between negative sentiment and depression—as found in widely used datasets like DAIC-WOZ—CDoA deliberately synthesizes utterances exhibiting “camouflaged” depression: speech with depressive acoustic signatures paired with positive or neutral semantics. This breaks the semantic shortcuts typically learned by models, compelling them to attend to invariant acoustic markers of depression. CDoA is operationalized using the DepFlow neural architecture, which disentangles depression acoustics from identity and content, thus enabling explicit control over semantic and affective properties in synthesized speech (Li et al., 1 Jan 2026).

1. Motivation and Statistical Shortcut Problem

The core impetus for CDoA derives from empirical observations within DAIC-WOZ and similar mental health corpora, where linguistic sentiment is strongly coupled to depression annotation. In the DAIC-WOZ training set, healthy subjects produce 35.2% positive utterances and 30.2% negative; depressed subjects invert this (30.0% positive, 36.2% negative), confirming sentiment–diagnosis entanglement. Classifiers trained on such data often acquire spurious “shortcuts” by equating negative sentiment with depression, undermining robustness—especially in cases of camouflaged depression, where affected individuals maintain socially positive or neutral surface language despite underlying pathology. The primary objective of CDoA is thus to construct training data exhibiting genuine depressive acoustics devoid of negative linguistic sentiment, thereby enforcing reliance on acoustic—not semantic—biomarkers.

2. Dataset Construction and Sampling Strategy

CDoA begins by stratifying DAIC-WOZ utterances into a benign text bank (positive/neutral utterances) and a depressive text bank (negative utterances), as labeled by DeepSeek-R1. Synthetic utterances are then generated by pairing depressed (and healthy) acoustic patterns with only the benign text bank, resulting in acoustic-semantic mismatches absent from natural corpora. Class balance is strictly enforced: 2,880 depressed and 2,880 healthy utterances, further stratified by PHQ-8 severity bins—Normal, Mild, Moderate, Moderately Severe, Severe—with per-bin quotas (e.g., Severe: 194 utterances per subject). Each synthetic utterance thus aligns a speaker’s depression severity (as encoded by PHQ-8 score) with content atypical for that class, amplifying underrepresented cases and coverage across the depression continuum.

Severity Bin Normal Mild Moderate Mod. Severe Severe
Per-subject Utter. 13 34 2 91 194

3. DepFlow Architecture and CDoA Synthesis Workflow

The DepFlow framework underpins CDoA data synthesis through three components:

I. Depression Acoustic Encoder (DAE):

  • Inputs frame-level WavLM-Large features {xt}t=1T\{x_t\}_{t=1}^T; projects via ReLU and attention pooling to a 32-dimensional depression embedding dd.
  • Multi-head, multitask objective: ordinal regression for PHQ-8 (supervised loss LsupL_{sup}); speaker and content adversarial heads (losses LadvspkL_{adv-spk}, LadvconL_{adv-con} via gradient reversal) to enforce speaker/content invariance.
  • Aggregated loss: Ltotal=λsupLsup+λidLid+λspkLadvspk+λconLadvconL_{total} = \lambda_{sup} L_{sup} + \lambda_{id} L_{id} + \lambda_{spk} L_{adv-spk} + \lambda_{con} L_{adv-con}.
  • Empirical properties: ROC-AUC≈0.693 for depression, EER≈0.355 for speaker verification, and R²≈0.21 for content regression—demonstrating successful disentanglement.

II. Flow-Matching TTS with FiLM:

  • Extends Matcha-TTS with FiLM modulation.
  • Conditioned on (phoneme sequence, speaker embedding, depression control embedding cdepc_{dep}).
  • FiLM layers: At decoder block ii, channel-wise scale-shift generated from cdepc_{dep}, h^i=γihi+βi\hat{h}_i = \gamma_i \odot h_i + \beta_i.
  • Loss: LdurL_{dur} (duration MSE), LpriorL_{prior} (prior), LfmL_{fm} (CFM decoder), total L=Ldur+λpLprior+LfmL = L_{dur} + \lambda_p L_{prior} + L_{fm}.

III. Prototype-Based Severity Mapping:

  • Calculates subject-level depression prototypes pkp_k for each PHQ-8 bin by l2l_2-normalized mean of subject embeddings.
  • At inference, desired score ss mapped to prototypes via α(s)=clip((s12)/12,1,1)\alpha(s) = \mathrm{clip}((s-12)/12, -1, 1), then cdep(infer)(s)=slerp(pi,pi+1;τ(s))c_{dep}^{(infer)}(s) = \mathrm{slerp}(p_i, p_{i+1}; \tau(s)).
  • Yields smoothly interpolable, interpretable depression control over the synthesized speech continuum.

CDoA Procedure:

  1. Map PHQ-8 score \rightarrow cdepc_{dep} via prototype+SLERP.
  2. Randomly sample a benign transcript.
  3. Run DepFlow TTS conditioned on (phonemes, speaker ID, cdepc_{dep}) \rightarrow waveform. By construction, this yields speech acoustically resembling a depressed subject uttering semantically positive/neutral content, simulating “camouflaged” depression.

4. Evaluation Protocols and Detection Robustness

CDoA-augmented data is evaluated using three representative depression detection architectures on the DAIC-WOZ test set:

  • DepAudioNet: (CNN+RNN on log-Mel)
  • NUSD: (TDNN with speaker-disentangling)
  • HAREN-CTC: (self-supervised hierarchical on raw waveforms)

Baselines include: no augmentation, FrAUG or SpecAugment, Mixup, and Instruct-TTS CDoA with generic voice-cloning.

Performance, measured as subject-level macro-F1, demonstrates consistent improvement with DepFlow-based CDoA:

  • DepAudioNet: 0.482 \rightarrow 0.526 (+9%)
  • NUSD: 0.514 \rightarrow 0.577 (+12%)
  • HAREN-CTC: 0.525 \rightarrow 0.551 (+5%)

DepFlow CDoA outperforms all baselines, including generic TTS augmentations, indicating that depression-aware acoustic conditioning specifically yields systematic robustness gains. Improvements in sensitivity/specificity balance further show these models become less dependent on sentiment cues and better at detecting depression with semantic camouflage (Li et al., 1 Jan 2026).

5. Implications for Training, Evaluation, and Conversational Systems

CDoA enables robustness training by teaching classifiers to disregard misleading sentiment signals and rely instead on invariant acoustic biomarkers. DepFlow’s controllability over both severity and surface semantics supports systematic stress-testing of detectors, such as varying depression severity at fixed content and vice versa. This controllable synthesis paradigm is directly applicable for simulation-based evaluation, particularly valuable where real-world clinical coverage is sparse or subject to ethical constraints.

Additionally, DepFlow allows the generation of clinical-style utterances at arbitrary severity levels for conversational systems like virtual interviewers or therapy bots, providing a pathway to address data scarcity and privacy concerns while maintaining clinical plausibility and diversity in training data.

6. Summary and Future Directions

CDoA operationalizes the deliberate dissociation of acoustic and semantic features in depression data augmentation. By leveraging disentangled depression embeddings, flow-matching speech synthesis with FiLM, and prototype-based severity control, CDoA introduces a rich set of underrepresented yet clinically plausible samples essential for robust depression detection. Empirical results validate the strategy of breaking dataset-induced semantic shortcuts, supporting the use of fine-grained, controllable augmentation mechanisms for mental health applications. Future directions include broader application of disentangled TTS in psychiatric and affective computing, expanded coverage of underrepresented mental health states, and integration into real-world clinical and conversational deployment scenarios (Li et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Camouflage Depression–Oriented Augmentation (CDoA).