Human-CLAP: Human-Centric Audio-Language Modeling
- Human-CLAP is a suite of models that integrates human perception and paralinguistic expertise to refine audio–language alignment.
- It uses regression-based fine-tuning and weighted contrastive losses to closely match human subjective ratings, substantially improving SRCC metrics.
- It incorporates expert prompt-engineering for paralinguistic tasks, enhancing both model interpretability and performance on applications like emotion recognition.
Human-CLAP refers to a suite of models and methodologies for contrastive language-audio pretraining that explicitly incorporate human perception or paralinguistic knowledge to improve alignment between audio–language representations and subjective human judgments. The paradigm addresses two primary limitations of previous CLAP approaches: the weak correlation of CLAPScore with human subjective ratings in text-to-audio evaluation, and the lack of systematic frameworks for modeling rich computational paralinguistic (CP) phenomena from raw audio. The field encompasses both regression-based fine-tuning on human-annotated data to correct metric misalignment (Takano et al., 30 Jun 2025), and prompt-engineering strategies that enrich training with paralinguistic expert knowledge (Jing et al., 11 Jun 2024).
1. Foundations of Contrastive Language–Audio Pretraining and Limitations
Standard CLAP models are based on contrastive learning between paired audio and language representations. Each modality (audio, text) is mapped to a high-dimensional embedding by encoders (e.g., HTS-AT, wav2vec 2.0-large for audio; BERT or RoBERTa for text), with similarity measured using the cosine between embeddings. CLAPScore, frequently used as a metric for text–audio relevance in generative tasks, is computed as:
However, empirical evaluation revealed a low Spearman’s rank correlation coefficient (SRCC ≈ 0.26–0.29) between CLAPScore predictions (from both LAION CLAP and MS CLAP) and human ratings of text–audio relevance on diverse audio–text pairs, indicating that CLAPScore poorly tracks human perceptual similarity (Takano et al., 30 Jun 2025). A plausible implication is that standard contrastive objectives do not explicitly calibrate model similarity to subjective human assessments of meaning or paralinguistic function.
2. Human-Perception-Based Fine-Tuning: Human-CLAP
To address the misalignment of CLAPScore with human perception, Human-CLAP introduces a fine-tuning protocol that directly incorporates human subjective scores:
- Existing CLAP encoders (audio: HTS-AT; text: RoBERTa) are fixed or adapted, and audio–text pairs are scored by human annotators on a 0–10 scale for relevance.
- The predicted similarity for each pair is defined as the cosine between paired embeddings:
- Regression objectives (MSE and MAE) align with the rescaled human scores :
- A weighted symmetric cross-entropy loss (wSCE) further emphasizes contrastive learning on pairs rated highly by humans:
Combined, the final training loss is of the form
Selection of the weights () and hyperparameters (AdamW optimizer, learning rate , batch size 8, 50 epochs) are guided by empirical performance (Takano et al., 30 Jun 2025).
3. Data Collection, Annotation, and Evaluation Protocol
Human-CLAP leverages human judgments collected on both natural and synthesized audio–text pairs:
- Source: AudioCaps dataset and text-to-audio (TTA) model outputs (AudioLDM, AudioLDM 2, Tango, Tango 2).
- Procedure: Each pair rated 0 (no match) to 10 (perfect) by multiple Prolific annotators, with screening for careless responses via anchor pairs.
- Scale and Distribution: 2,000 natural and 4,000 synthesized pairs; ≈6,800 total ratings; ≈4 annotators per pair after filtering.
- Evaluation: Models are assessed by SRCC, LCC (Pearson), and mean squared error (MSE) between CLAPScore and human subjective scores on a held-out test set of 2,405 pairs (Takano et al., 30 Jun 2025).
4. Experimental Results and Analysis
Empirical studies demonstrate systematic improvements in alignment with human ratings:
| Model | SRCC | LCC | MSE |
|---|---|---|---|
| LAION CLAP | 0.259 | 0.277 | 0.212 |
| MS CLAP | 0.278 | 0.296 | 0.078 |
| Human-CLAP (wSCE only) | 0.355 | 0.380 | 0.077 |
| Human-CLAP (MSE only) | 0.503 | 0.536 | 0.050 |
| Human-CLAP (MAE only) | 0.512 | 0.544 | 0.049 |
| Human-CLAP (wSCE+MSE) | 0.502 | 0.531 | 0.058 |
| Human-CLAP (wSCE+MAE) | 0.506 | 0.539 | 0.053 |
For synthesized audio, Human-CLAP (wSCE+MAE) achieved SRCC 0.588 (vs. 0.316 for LAION CLAP), and for natural audio 0.345 (vs. 0.192). This suggests that regression-based fine-tuning nearly doubles the metric correlation with human perception in both domains. Score-range ablation confirmed gains across low- and high-relevance subsets (SRCC 0.31–0.36, compared to 0.14–0.21 for baseline) (Takano et al., 30 Jun 2025).
5. Paralinguistic Human-CLAP: Model and Training Innovations
Extending the Human-CLAP approach to computational paralinguistic (CP) tasks, ParaCLAP introduces structured techniques for generating rich, human-interpretable audio–text pairs (Jing et al., 11 Jun 2024):
- Audio encoder: wav2vec 2.0-large (12-layer transformer), pre-fine-tuned on MSP-Podcast.
- Text encoder: BERT base (uncased).
- Projection heads: Modality-specific MLPs mapping to a shared 768-dimensional embedding space.
- Contrastive loss: Symmetric cross-entropy (InfoNCE).
A major innovation is the two-fold "templating" for query generation:
- (A) Transform categorical/dimensional CP labels into natural-language prompts (e.g., “speaker is happy”, “arousal is high”).
- (B) Use eGeMAPS-extracted expert features (mean/std pitch, intensity, jitter, shimmer, duration), binned and described textually (e.g., “pitch is high”), randomly concatenated for diversity. At training, up to atomic queries are randomly concatenated; at inference, only label-based queries are used.
6. Downstream Performance and Comparative Analysis
ParaCLAP, a human-focused CLAP for CP, was evaluated in a zero-shot regime on seven speech datasets across various emotion, affect, and speaker trait recognition tasks. Unweighted Average Recall (UAR) served as the metric:
| Dataset | CLAP | Pengi | ParaCLAP (no-emo) | ParaCLAP (only-emo) | ParaCLAP (rand=1) | ParaCLAP (rand=5) |
|---|---|---|---|---|---|---|
| IEMOCAP | 0.353 | 0.345 | 0.309 | 0.567 | 0.307 | 0.560 |
| RAVDESS | 0.199 | 0.148 | 0.170 | 0.302 | 0.116 | 0.234 |
| CREMA-D | 0.230 | 0.245 | 0.201 | 0.332 | 0.202 | 0.291 |
| TESS | 0.232 | 0.177 | 0.212 | 0.484 | 0.219 | 0.389 |
| FAU Aibo(2) | 0.500 | 0.470 | 0.538 | 0.535 | 0.468 | 0.604 |
| FAU Aibo(5) | 0.211 | 0.185 | 0.225 | 0.216 | 0.216 | 0.232 |
| ALC | 0.511 | 0.473 | 0.490 | 0.512 | 0.501 | 0.503 |
| SLD | 0.472 | 0.485 | 0.472 | 0.554 | 0.443 | 0.507 |
ParaCLAP outperforms baseline CLAP and the Pengi model in most tasks, with "only-emo" queries excelling on typical emotion classification and feature-augmented prompts further benefiting more complex or diagnostic paralinguistic sets (e.g., FAU Aibo, ALC) (Jing et al., 11 Jun 2024).
7. Significance, Limitations, and Prospects
Human-CLAP demonstrates that incorporating human-perception-based supervision and prompt-engineered paralinguistic queries improves both the accuracy and interpretability of audio–LLMs. By directly aligning similarity metrics with human ratings (through regression and weighted contrastive losses), Human-CLAP bridges a persistent gap between model predictions and subjective relevance judgments, surpassing baseline correlation by over 0.25 SRCC (Takano et al., 30 Jun 2025).
Key limitations include limited data scale (≈6,800 rated pairs), restricted dynamic range at low similarity in pure regression, and data imbalance. Future research directions include:
- Expanding human-annotated evaluation corpora, especially in low-relevance regions.
- Adaptive weighting or curriculum learning to refine negative sampling and distributional calibration.
- Scaling prompt engineering through instruction-tuned LLMs to cover broader paralinguistic, identity, and health attributes (Jing et al., 11 Jun 2024).
- Integrating Human-CLAP within generative or diagnostic frameworks for speech, source separation, and captioning.
Overall, Human-CLAP establishes a blueprint for aligning audio–LLMs to human perception, enabling robust, generalizable, and interpretable performance on both general text-to-audio tasks and nuanced computational paralinguistics (Jing et al., 11 Jun 2024, Takano et al., 30 Jun 2025).