Papers
Topics
Authors
Recent
Search
2000 character limit reached

SmoothCLAP: Soft-Target Enhanced CLAP

Updated 25 January 2026
  • SmoothCLAP is a soft-target enhanced variant of CLAP that diffuses supervisory signals across semantically related samples to capture fuzzy emotional boundaries.
  • It integrates intra-modal and paralinguistic similarities using a wav2vec2.0-large audio encoder and a BERT-base text encoder for improved embedding quality.
  • Empirical evaluations demonstrate improved zero-shot performance on multiple emotion benchmarks, showcasing robust cross-lingual transfer and nuanced affective representation.

SmoothCLAP is a soft-target enhanced variant of Contrastive Language–Audio Pretraining (CLAP) designed to address the inherent ambiguity and graded boundaries of human emotions in affective computing tasks. By diffusing supervisory signal among semantically related and paralinguistically similar samples, SmoothCLAP produces embeddings that better respect psychological emotion spaces and achieves improved zero-shot performance across diverse benchmarks (Jing et al., 18 Jan 2026).

1. Foundations: CLAP and Its Limitations

Contrastive Language–Audio Pretraining (CLAP) models aim to map paired audio and text samples into a shared latent space by jointly optimizing an audio encoder faf_a and a text encoder ftf_t. The embedded representations:

  • Audio: zia=projA(fa(Ai))Rdz_i^a = \mathrm{proj}_A\bigl(f_a(A_i)\bigr) \in \mathbb{R}^d
  • Text: zit=projT(ft(Ti))Rdz_i^t = \mathrm{proj}_T\bigl(f_t(T_i)\bigr) \in \mathbb{R}^d

Pairwise similarity is computed via scaled cosine similarity:

sij=ziazjtτs_{ij} = \frac{z_i^a \cdot z_j^t}{\tau}

where τ\tau is learnable.

Supervision relies on a symmetric InfoNCE objective over a batch of BB samples:

LCLAP=12Bi=1B[logexp(sii)j=1Bexp(sij)+logexp(sii)j=1Bexp(sji)]\mathcal{L}_\mathrm{CLAP} = -\frac{1}{2B} \sum_{i=1}^B \left[\log\frac{\exp(s_{ii})}{\sum_{j=1}^B \exp(s_{ij})} + \log\frac{\exp(s_{ii})}{\sum_{j=1}^B \exp(s_{ji})}\right]

This enforces rigid one-to-one alignment of audio–text pairs and treats all non-matching pairs as equally negative.

For affective-computing, such hard supervision is mismatched to the fuzzy category structure of emotions (e.g., "fear" and "disgust" overlap more than "fear" and "happiness"), as emotional states are not cleanly separable.

2. Motivations for Soft-Target Supervision

SmoothCLAP introduces soft targets to remedy CLAP’s disregard of intra-modal similarity and rich paralinguistic structure:

  • Intra-modal similarity: Emotional neighboring samples often share acoustic or semantic features that should partially overlap in supervision.
  • Paralinguistic features: Computationally derived vocal cues (pitch, intensity, jitter, etc.) encode nuanced affective information.

Instead of enforcing all non-matching samples as strictly negative, SmoothCLAP diffuses some probability mass to samples identified as semantically or paralinguistically close in “emotion space”.

3. Architecture and Training Objective

SmoothCLAP consists of parallel audio and text branches:

  • Audio branch: wav2vec2.0-large encoder (finetuned on Speech Emotion Recognition) is used for both global and frame-level (“local”) feature extraction. Parameters are frozen during training.
  • Text branch: BERT-base encoder, trainable, with a fully-connected projection.

Soft Target Construction

Within each minibatch (BB samples), intra-modal similarity is computed:

Audio-to-audio (a2a): For mean-pooled local features ˉia\bar\ell^a_i:

qija2a=exp(ˉiaˉja/τa2a)k=1Bexp(ˉiaˉka/τa2a)q^{a2a}_{ij} = \frac{\exp\bigl( \bar\ell^a_i \cdot \bar\ell^a_j / \tau_{a2a} \bigr)}{ \sum_{k=1}^B \exp\bigl( \bar\ell^a_i \cdot \bar\ell^a_k / \tau_{a2a} \bigr) }

Text-to-text (t2t): With 2\ell_2-normalized embeddings eite^t_i:

qijt2t=exp(eitejt/τt2t)k=1Bexp(eitekt/τt2t)q^{t2t}_{ij} = \frac{ \exp\bigl( e^t_i \cdot e^t_j / \tau_{t2t}\bigr) }{ \sum_{k=1}^B \exp \bigl( e^t_i \cdot e^t_k / \tau_{t2t}\bigr) }

Mixture: γ\gamma controls mixing:

qij=(1γ)qija2a+γqijt2tq_{ij} = (1-\gamma) q^{a2a}_{ij} + \gamma q^{t2t}_{ij}

Fusion with one-hot targets (β\beta fusion factor):

yij=(1β)δij+βqijy_{ij} = (1-\beta) \delta_{ij} + \beta q_{ij}

Objective

Predicted cross-modal distributions pija2t,pijt2ap^{a2t}_{ij}, p^{t2a}_{ij} are computed from the final projected and normalized embeddings. A symmetric KL divergence loss is used:

Lsoft=12Bi=1B{KL(yipia2t)+KL(pia2tyi)+KL(yipit2a)+KL(pit2ayi)}\mathcal{L}_{\mathrm{soft}} = \frac{1}{2B} \sum_{i=1}^B \left\{ KL(y_i \| p^{a2t}_i) + KL(p^{a2t}_i \| y_i) + KL(y_i \| p^{t2a}_i) + KL(p^{t2a}_i \| y_i) \right\}

Setting β=0\beta=0 recovers the original InfoNCE.

Hyperparameters: Grid search found optimal γ0.5\gamma \approx 0.5, β0.1\beta \approx 0.1. Text encoder LR: 1×1051\times10^{-5}; projection and temperature LR: 1×1031\times10^{-3}; batch size 32, 10 epochs, Adam optimizer.

4. Inference Pipeline

Inference with SmoothCLAP is identical to CLAP:

  1. Audio input is embedded via faf_a and projA\mathrm{proj}_A.
  2. Candidate text labels are embedded via ftf_t and projT\mathrm{proj}_T.
  3. Cosine similarity zazctz^a \cdot z^t_c is computed.
  4. The label with highest score is selected, or retrieval is thresholded.

No computation of intra-modal similarity is performed at test time; the forward path is unchanged.

5. Experimental Evaluation

Datasets: Training uses MSP-Podcast v1.9 (45,619 utterances, 10 emotion categories). Zero-shot evaluation spans eight corpora in English and German, including IEMOCAP, RAVDESS, CREMA-D, TESS, FAU-Aibo (2/5 class), ALC (intoxication), and SLD (likability).

Baselines: CLAP, Pengi (audio-LDM), ParaCLAP.

Main Results (Unweighted Average Recall, UAR):

Dataset CLAP Pengi ParaCLAP SmoothCLAP
IEMOCAP (4cl/en) .353 .345 .600 .606
RAVDESS (8cl/en) .199 .148 .228 .175
CREMA-D (6cl/en) .230 .245 .177 .266
TESS (7cl/en) .232 .177 .170 .275
FAU-Aibo (2cl/de) .500 .470 .526 .555
FAU-Aibo (5cl/de) .211 .185 .197 .204
ALC (2cl/de) .511 .473 .537 .541
SLD (2cl/de) .472 .485 .507 .496

SmoothCLAP achieves best UAR on 5/8 tasks and second-best on 2/8, especially improving on IEMOCAP, CREMA-D, TESS, and FAU-Aibo (2-class). Cross-lingual transfer from English to German remains strong.

6. Ablation Studies

Local Feature Extractor:

Performance varies across local feature encoders. No single extractor excels universally, suggesting dataset-specific paralinguistic sensitivity.

LocFeat Encoder IEMOCAP RAVDESS CREMA-D TESS
wav2vec2.0-Emo (finetuned) .606 .175 .266 .275
wav2vec2.0-Large (self-sup.) .594 .173 .260 .368
WavLM-Large .595 .212 .201 .267
HuBERT-Large .574 .260 .259 .433

Impact of γ\gamma and β\beta:

Grid search shows no monotonic trend for mixing γ\gamma; fusion β\beta favors smaller values, indicating the need to retain strong hard-target signals.

7. Analysis and Future Directions

Soft targets facilitate improved learning by bridging fuzzy emotional categories—diffusing probability mass to semantically or paralinguistically neighboring samples enables a smoother embedding space, approximating human perceptual emotion similarity.

SmoothCLAP underperforms on 3/8 benchmarks, implying no approach is universally optimal for emotion recognition. Performance may depend on training data scale and diversity. Future improvements may involve adopting alternative similarity kernels (e.g., Gaussian over CP features), supporting multi-label or fuzzy annotation schemes, or end-to-end training of the local feature extractor.

SmoothCLAP demonstrates consistent zero-shot performance gains for affect-aware language–audio models by incorporating soft supervision rooted in intra-modal and paralinguistic similarities, producing more graded, psychologically plausible embeddings (Jing et al., 18 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SmoothCLAP.