SmoothCLAP: Soft-Target Enhanced CLAP
- SmoothCLAP is a soft-target enhanced variant of CLAP that diffuses supervisory signals across semantically related samples to capture fuzzy emotional boundaries.
- It integrates intra-modal and paralinguistic similarities using a wav2vec2.0-large audio encoder and a BERT-base text encoder for improved embedding quality.
- Empirical evaluations demonstrate improved zero-shot performance on multiple emotion benchmarks, showcasing robust cross-lingual transfer and nuanced affective representation.
SmoothCLAP is a soft-target enhanced variant of Contrastive Language–Audio Pretraining (CLAP) designed to address the inherent ambiguity and graded boundaries of human emotions in affective computing tasks. By diffusing supervisory signal among semantically related and paralinguistically similar samples, SmoothCLAP produces embeddings that better respect psychological emotion spaces and achieves improved zero-shot performance across diverse benchmarks (Jing et al., 18 Jan 2026).
1. Foundations: CLAP and Its Limitations
Contrastive Language–Audio Pretraining (CLAP) models aim to map paired audio and text samples into a shared latent space by jointly optimizing an audio encoder and a text encoder . The embedded representations:
- Audio:
- Text:
Pairwise similarity is computed via scaled cosine similarity:
where is learnable.
Supervision relies on a symmetric InfoNCE objective over a batch of samples:
This enforces rigid one-to-one alignment of audio–text pairs and treats all non-matching pairs as equally negative.
For affective-computing, such hard supervision is mismatched to the fuzzy category structure of emotions (e.g., "fear" and "disgust" overlap more than "fear" and "happiness"), as emotional states are not cleanly separable.
2. Motivations for Soft-Target Supervision
SmoothCLAP introduces soft targets to remedy CLAP’s disregard of intra-modal similarity and rich paralinguistic structure:
- Intra-modal similarity: Emotional neighboring samples often share acoustic or semantic features that should partially overlap in supervision.
- Paralinguistic features: Computationally derived vocal cues (pitch, intensity, jitter, etc.) encode nuanced affective information.
Instead of enforcing all non-matching samples as strictly negative, SmoothCLAP diffuses some probability mass to samples identified as semantically or paralinguistically close in “emotion space”.
3. Architecture and Training Objective
SmoothCLAP consists of parallel audio and text branches:
- Audio branch: wav2vec2.0-large encoder (finetuned on Speech Emotion Recognition) is used for both global and frame-level (“local”) feature extraction. Parameters are frozen during training.
- Text branch: BERT-base encoder, trainable, with a fully-connected projection.
Soft Target Construction
Within each minibatch ( samples), intra-modal similarity is computed:
Audio-to-audio (a2a): For mean-pooled local features :
Text-to-text (t2t): With -normalized embeddings :
Mixture: controls mixing:
Fusion with one-hot targets ( fusion factor):
Objective
Predicted cross-modal distributions are computed from the final projected and normalized embeddings. A symmetric KL divergence loss is used:
Setting recovers the original InfoNCE.
Hyperparameters: Grid search found optimal , . Text encoder LR: ; projection and temperature LR: ; batch size 32, 10 epochs, Adam optimizer.
4. Inference Pipeline
Inference with SmoothCLAP is identical to CLAP:
- Audio input is embedded via and .
- Candidate text labels are embedded via and .
- Cosine similarity is computed.
- The label with highest score is selected, or retrieval is thresholded.
No computation of intra-modal similarity is performed at test time; the forward path is unchanged.
5. Experimental Evaluation
Datasets: Training uses MSP-Podcast v1.9 (45,619 utterances, 10 emotion categories). Zero-shot evaluation spans eight corpora in English and German, including IEMOCAP, RAVDESS, CREMA-D, TESS, FAU-Aibo (2/5 class), ALC (intoxication), and SLD (likability).
Baselines: CLAP, Pengi (audio-LDM), ParaCLAP.
Main Results (Unweighted Average Recall, UAR):
| Dataset | CLAP | Pengi | ParaCLAP | SmoothCLAP |
|---|---|---|---|---|
| IEMOCAP (4cl/en) | .353 | .345 | .600 | .606 |
| RAVDESS (8cl/en) | .199 | .148 | .228 | .175 |
| CREMA-D (6cl/en) | .230 | .245 | .177 | .266 |
| TESS (7cl/en) | .232 | .177 | .170 | .275 |
| FAU-Aibo (2cl/de) | .500 | .470 | .526 | .555 |
| FAU-Aibo (5cl/de) | .211 | .185 | .197 | .204 |
| ALC (2cl/de) | .511 | .473 | .537 | .541 |
| SLD (2cl/de) | .472 | .485 | .507 | .496 |
SmoothCLAP achieves best UAR on 5/8 tasks and second-best on 2/8, especially improving on IEMOCAP, CREMA-D, TESS, and FAU-Aibo (2-class). Cross-lingual transfer from English to German remains strong.
6. Ablation Studies
Local Feature Extractor:
Performance varies across local feature encoders. No single extractor excels universally, suggesting dataset-specific paralinguistic sensitivity.
| LocFeat Encoder | IEMOCAP | RAVDESS | CREMA-D | TESS |
|---|---|---|---|---|
| wav2vec2.0-Emo (finetuned) | .606 | .175 | .266 | .275 |
| wav2vec2.0-Large (self-sup.) | .594 | .173 | .260 | .368 |
| WavLM-Large | .595 | .212 | .201 | .267 |
| HuBERT-Large | .574 | .260 | .259 | .433 |
Impact of and :
Grid search shows no monotonic trend for mixing ; fusion favors smaller values, indicating the need to retain strong hard-target signals.
7. Analysis and Future Directions
Soft targets facilitate improved learning by bridging fuzzy emotional categories—diffusing probability mass to semantically or paralinguistically neighboring samples enables a smoother embedding space, approximating human perceptual emotion similarity.
SmoothCLAP underperforms on 3/8 benchmarks, implying no approach is universally optimal for emotion recognition. Performance may depend on training data scale and diversity. Future improvements may involve adopting alternative similarity kernels (e.g., Gaussian over CP features), supporting multi-label or fuzzy annotation schemes, or end-to-end training of the local feature extractor.
SmoothCLAP demonstrates consistent zero-shot performance gains for affect-aware language–audio models by incorporating soft supervision rooted in intra-modal and paralinguistic similarities, producing more graded, psychologically plausible embeddings (Jing et al., 18 Jan 2026).