Papers
Topics
Authors
Recent
2000 character limit reached

Human-CLAP: Human-Centric Audio-Language Modeling

Updated 15 December 2025
  • Human-CLAP is a suite of models that integrates human perception and paralinguistic expertise to refine audio–language alignment.
  • It uses regression-based fine-tuning and weighted contrastive losses to closely match human subjective ratings, substantially improving SRCC metrics.
  • It incorporates expert prompt-engineering for paralinguistic tasks, enhancing both model interpretability and performance on applications like emotion recognition.

Human-CLAP refers to a suite of models and methodologies for contrastive language-audio pretraining that explicitly incorporate human perception or paralinguistic knowledge to improve alignment between audio–language representations and subjective human judgments. The paradigm addresses two primary limitations of previous CLAP approaches: the weak correlation of CLAPScore with human subjective ratings in text-to-audio evaluation, and the lack of systematic frameworks for modeling rich computational paralinguistic (CP) phenomena from raw audio. The field encompasses both regression-based fine-tuning on human-annotated data to correct metric misalignment (Takano et al., 30 Jun 2025), and prompt-engineering strategies that enrich training with paralinguistic expert knowledge (Jing et al., 11 Jun 2024).

1. Foundations of Contrastive Language–Audio Pretraining and Limitations

Standard CLAP models are based on contrastive learning between paired audio and language representations. Each modality (audio, text) is mapped to a high-dimensional embedding by encoders (e.g., HTS-AT, wav2vec 2.0-large for audio; BERT or RoBERTa for text), with similarity measured using the cosine between embeddings. CLAPScore, frequently used as a metric for text–audio relevance in generative tasks, is computed as:

CLAPScore=max ⁣(eaudio ⁣ ⁣etexteaudio  etext,0)\textsf{CLAPScore} = \max\!\biggl( \frac{e^{\text{audio}}\!\cdot\!e^{\text{text}}}{\|e^{\text{audio}}\|\;\|e^{\text{text}}\|}, 0 \biggr)

However, empirical evaluation revealed a low Spearman’s rank correlation coefficient (SRCC ≈ 0.26–0.29) between CLAPScore predictions (from both LAION CLAP and MS CLAP) and human ratings of text–audio relevance on diverse audio–text pairs, indicating that CLAPScore poorly tracks human perceptual similarity (Takano et al., 30 Jun 2025). A plausible implication is that standard contrastive objectives do not explicitly calibrate model similarity to subjective human assessments of meaning or paralinguistic function.

2. Human-Perception-Based Fine-Tuning: Human-CLAP

To address the misalignment of CLAPScore with human perception, Human-CLAP introduces a fine-tuning protocol that directly incorporates human subjective scores:

  • Existing CLAP encoders (audio: HTS-AT; text: RoBERTa) are fixed or adapted, and audio–text pairs are scored by human annotators on a 0–10 scale for relevance.
  • The predicted similarity yiy_i for each pair is defined as the cosine between paired embeddings:

yi=eitext ⁣ ⁣eiaudioeitext  eiaudioy_i = \frac{e_i^{\text{text}}\!\cdot\!e_i^{\text{audio}}}{\|e_i^{\text{text}}\|\;\|e_i^{\text{audio}}\|}

  • Regression objectives (MSE and MAE) align yiy_i with the rescaled human scores aia_i:

LMSE=1Ni=1N(aiyi)2,LMAE=1Ni=1NaiyiL_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^N (a_i - y_i)^2,\qquad L_{\text{MAE}} = \frac{1}{N}\sum_{i=1}^N |a_i - y_i|

  • A weighted symmetric cross-entropy loss (wSCE) further emphasizes contrastive learning on pairs rated highly by humans:

LwSCE=12Ni=1Nai[logexp((eitexteiaudio)/τ)jexp((eitextejaudio)/τ)+logexp((eiaudioeitext)/τ)jexp((eiaudioejtext)/τ)]L_{\text{wSCE}} = -\frac{1}{2N} \sum_{i=1}^N a_i \Biggl[ \log\frac{\exp((e_i^{\text{text}}\cdot e_i^{\text{audio}})/\tau)} {\sum_j\exp((e_i^{\text{text}}\cdot e_j^{\text{audio}})/\tau)} + \log\frac{\exp((e_i^{\text{audio}}\cdot e_i^{\text{text}})/\tau)} {\sum_j\exp((e_i^{\text{audio}}\cdot e_j^{\text{text}})/\tau)} \Biggr]

Combined, the final training loss is of the form

L=λ1LwSCE+λ2LMSE+λ3LMAEL = \lambda_1\,L_{\text{wSCE}} + \lambda_2\,L_{\text{MSE}} + \lambda_3\,L_{\text{MAE}}

Selection of the weights (λ1=0.1,  λ2=1,  λ3=1\lambda_1=0.1,\;\lambda_2=1,\;\lambda_3=1) and hyperparameters (AdamW optimizer, learning rate 1e–51\text{e--5}, batch size 8, 50 epochs) are guided by empirical performance (Takano et al., 30 Jun 2025).

3. Data Collection, Annotation, and Evaluation Protocol

Human-CLAP leverages human judgments collected on both natural and synthesized audio–text pairs:

  • Source: AudioCaps dataset and text-to-audio (TTA) model outputs (AudioLDM, AudioLDM 2, Tango, Tango 2).
  • Procedure: Each pair rated 0 (no match) to 10 (perfect) by multiple Prolific annotators, with screening for careless responses via anchor pairs.
  • Scale and Distribution: 2,000 natural and 4,000 synthesized pairs; ≈6,800 total ratings; ≈4 annotators per pair after filtering.
  • Evaluation: Models are assessed by SRCC, LCC (Pearson), and mean squared error (MSE) between CLAPScore and human subjective scores on a held-out test set of 2,405 pairs (Takano et al., 30 Jun 2025).

4. Experimental Results and Analysis

Empirical studies demonstrate systematic improvements in alignment with human ratings:

Model SRCC LCC MSE
LAION CLAP 0.259 0.277 0.212
MS CLAP 0.278 0.296 0.078
Human-CLAP (wSCE only) 0.355 0.380 0.077
Human-CLAP (MSE only) 0.503 0.536 0.050
Human-CLAP (MAE only) 0.512 0.544 0.049
Human-CLAP (wSCE+MSE) 0.502 0.531 0.058
Human-CLAP (wSCE+MAE) 0.506 0.539 0.053

For synthesized audio, Human-CLAP (wSCE+MAE) achieved SRCC 0.588 (vs. 0.316 for LAION CLAP), and for natural audio 0.345 (vs. 0.192). This suggests that regression-based fine-tuning nearly doubles the metric correlation with human perception in both domains. Score-range ablation confirmed gains across low- and high-relevance subsets (SRCC 0.31–0.36, compared to 0.14–0.21 for baseline) (Takano et al., 30 Jun 2025).

5. Paralinguistic Human-CLAP: Model and Training Innovations

Extending the Human-CLAP approach to computational paralinguistic (CP) tasks, ParaCLAP introduces structured techniques for generating rich, human-interpretable audio–text pairs (Jing et al., 11 Jun 2024):

  • Audio encoder: wav2vec 2.0-large (12-layer transformer), pre-fine-tuned on MSP-Podcast.
  • Text encoder: BERT base (uncased).
  • Projection heads: Modality-specific MLPs mapping to a shared 768-dimensional embedding space.
  • Contrastive loss: Symmetric cross-entropy (InfoNCE).

A major innovation is the two-fold "templating" for query generation:

  • (A) Transform categorical/dimensional CP labels into natural-language prompts (e.g., “speaker is happy”, “arousal is high”).
  • (B) Use eGeMAPS-extracted expert features (mean/std pitch, intensity, jitter, shimmer, duration), binned and described textually (e.g., “pitch is high”), randomly concatenated for diversity. At training, up to nn atomic queries are randomly concatenated; at inference, only label-based queries are used.

6. Downstream Performance and Comparative Analysis

ParaCLAP, a human-focused CLAP for CP, was evaluated in a zero-shot regime on seven speech datasets across various emotion, affect, and speaker trait recognition tasks. Unweighted Average Recall (UAR) served as the metric:

Dataset CLAP Pengi ParaCLAP (no-emo) ParaCLAP (only-emo) ParaCLAP (rand=1) ParaCLAP (rand=5)
IEMOCAP 0.353 0.345 0.309 0.567 0.307 0.560
RAVDESS 0.199 0.148 0.170 0.302 0.116 0.234
CREMA-D 0.230 0.245 0.201 0.332 0.202 0.291
TESS 0.232 0.177 0.212 0.484 0.219 0.389
FAU Aibo(2) 0.500 0.470 0.538 0.535 0.468 0.604
FAU Aibo(5) 0.211 0.185 0.225 0.216 0.216 0.232
ALC 0.511 0.473 0.490 0.512 0.501 0.503
SLD 0.472 0.485 0.472 0.554 0.443 0.507

ParaCLAP outperforms baseline CLAP and the Pengi model in most tasks, with "only-emo" queries excelling on typical emotion classification and feature-augmented prompts further benefiting more complex or diagnostic paralinguistic sets (e.g., FAU Aibo, ALC) (Jing et al., 11 Jun 2024).

7. Significance, Limitations, and Prospects

Human-CLAP demonstrates that incorporating human-perception-based supervision and prompt-engineered paralinguistic queries improves both the accuracy and interpretability of audio–LLMs. By directly aligning similarity metrics with human ratings (through regression and weighted contrastive losses), Human-CLAP bridges a persistent gap between model predictions and subjective relevance judgments, surpassing baseline correlation by over 0.25 SRCC (Takano et al., 30 Jun 2025).

Key limitations include limited data scale (≈6,800 rated pairs), restricted dynamic range at low similarity in pure regression, and data imbalance. Future research directions include:

  • Expanding human-annotated evaluation corpora, especially in low-relevance regions.
  • Adaptive weighting or curriculum learning to refine negative sampling and distributional calibration.
  • Scaling prompt engineering through instruction-tuned LLMs to cover broader paralinguistic, identity, and health attributes (Jing et al., 11 Jun 2024).
  • Integrating Human-CLAP within generative or diagnostic frameworks for speech, source separation, and captioning.

Overall, Human-CLAP establishes a blueprint for aligning audio–LLMs to human perception, enabling robust, generalizable, and interpretable performance on both general text-to-audio tasks and nuanced computational paralinguistics (Jing et al., 11 Jun 2024, Takano et al., 30 Jun 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Human-CLAP.