CLAPScore-Based Architecture Overview

Updated 7 January 2026

The CLAPScore-based architecture leverages contrastive language-audio pretraining to embed audio and text into a joint space, enabling similarity-based alignment using cosine similarity.
Standardized Preference Optimization (SPO) standardizes listener ratings to reduce bias, combining regression and contrastive ranking to improve correlations with human judgments.
Listener screening is integrated to exclude outlier annotations, enhancing model robustness and achieving superior evaluation metrics on tasks like the XACLE Challenge.

A CLAPScore-based architecture refers to systems that use CLAP (Contrastive Language-Audio Pretraining) embedding models to quantify alignment between audio and text through a similarity-based metric, with further optimization via techniques such as Standardized Preference Optimization (SPO). These architectures target tasks where automatic evaluation metrics for cross-modal semantic alignment are required—typically in settings like the XACLE Challenge—where human ground-truth preferences are noisy, inconsistent, or otherwise insufficient for direct supervised learning. Augmenting CLAPScore predictors with preference optimization strategies allows for more direct modeling of relative human judgment, enhanced robustness to annotator bias, and improved correlation with human evaluators.

1. CLAPScore Foundations and Motivations

CLAPScore computes an alignment measure for audio–text pairs by embedding each modality into a joint space using CLAP encoders and then calculating cosine similarity, scaled to fit a human-interpretable range (commonly [0, 10]). In XACLE and similar evaluation tasks, each sample is annotated by multiple human raters with notable variability in offset, scale, and extremity of rating. Training regression models on these unnormalized scores risks overfitting to individual raters' subjective tendencies rather than capturing consensus alignment. Accordingly, alignment prediction systems grounded in CLAPScore benefit from reframing the target as a relative preference—predicting which of a set of candidates for a given prompt is superior—rather than estimating absolute MOS (Mean Opinion Score) (Takano et al., 6 Jan 2026).

2. Standardized Preference Optimization (SPO) in CLAPScore Architectures

Standardized Preference Optimization (SPO), within the context of CLAPScore architectures, encompasses two principal steps:

Standardization: Given sample $x$ and listener $\ell$ , with raw rating $x$ on a $0$–$10$ scale, calculate listener parameters $\mu_\ell$ (mean) and $\sigma_\ell$ (standard deviation) across all of $\ell$ 's ratings. Define the standardized score:

$x_{\text{SPO}} = \frac{x - \mu_\ell}{\sigma_\ell}$

This transformation projects all ratings by a listener onto a zero-mean, unit-variance axis, eliminating idiosyncratic offset and scale.

Optimization Objective: For a candidate audio–text pair, the model predicts alignment score $\hat{y}$ (from the scaled CLAP cosine similarity). This score is itself standardized using global training-set mean and variance to produce $\hat{y}^{\wedge}$ . The joint loss is:

$L = L_{\text{reg}}(x_{\text{SPO}}, \hat{y}^{\wedge}) + \lambda L^{\text{con}}(x_{\text{SPO}}, \hat{y}^{\wedge})$

$L_{\text{reg}}$ is mean squared error (MSE), $L^{\text{con}}$ is a contrastive (InfoNCE-style) loss enforcing correct preference ranking within each prompt group, and $\lambda$ is a tunable weighting coefficient (Takano et al., 6 Jan 2026).

This paradigm reinterprets alignment as a relative ranking problem embedded in standardized listener space, channeling the learning dynamics to consistent preference modeling.

3. Integration of Listener Screening

To further mitigate the effect of unreliable human ratings, the architecture incorporates a listener screening phase, discarding annotation streams from listeners identified as outliers. In SPO-CLAPScore, the protocol defines an "NG-Score" as any rating lacking proximate agreement (within a tolerance $\tau$ ) from any other listener for the same prompt. Listeners producing NG-scores at a rate $r$ above a set threshold are excluded outright. The recommended values are $\tau = 5$ , $r = 0.2$ . This removal precedes SPO normalization and further reduces spurious variance and annotation noise in the training targets (Takano et al., 6 Jan 2026).

4. Training Procedures and Model Architecture

The backbone of the architecture is a CLAP-derived audio encoder (fine-tuned, e.g., M2D-CLAP 2025) and a frozen BERT-base text encoder. Audio and text samples are encoded into embedding space, after which the cosine similarity forms the basis for the predicted alignment score. During training, three modalities are combined in the loss:

Regression to standardized targets (SPO-normalized scores)
Contrastive ranking within candidate sets (triplet or triplet-group based, adopting InfoNCE as in [Saeki et al., Interspeech 2022])
Ensemble prediction: Multiple models are trained (with/without warmup, various seeds, and combinations of screening/contrastive loss), and averaged for inference

Optimization proceeds via Adam with warmup to a peak learning rate (1e-4), followed by decay over 50 epochs (Takano et al., 6 Jan 2026).

5. Empirical Results and Comparative Performance

On the XACLE Challenge, SPO-CLAPScore achieved the following test set metrics:

Method	SRCC	LCC	KTAU	MSE
Baseline	0.3345	0.3420	0.229	4.811
SPO-CLAPScore	0.6142	0.6542	0.4407	2.985

In ablation studies, inclusion of SPO yielded a +0.0959 absolute improvement in SRCC (test set) and nearly halved the MSE compared to non-SPO models. This demonstrates that SPO, particularly when combined with listener screening, constitutes a substantial advance over both vanilla CLAPScore and naive regression approaches to MOS prediction (Takano et al., 6 Jan 2026).

6. Limitations and Directions for Extension

SPO requires sufficiently many ratings per listener/prompt group for $\mu_\ell$ and $\sigma_\ell$ estimation to be meaningful. The method assumes (approximate) Gaussian behavior among listener scores and may require quantile or nonlinear normalization if the rating distribution is strongly multimodal or heavy-tailed. A further limitation is its reliance on fixed listener pools; scenarios with high annotator turnover or sparse per-listener data may weaken normalization robustness. Extensions include integrating higher-order preference constraints (e.g., triplet or listwise ranking), online estimation of listener embeddings, and adaptation to other multimodal MOS tasks such as speech synthesis or voice conversion (Takano et al., 6 Jan 2026).

7. Relationship to Broader Preference Optimization Literature

The SPO-CLAPScore framework is closely related to broader Standardized Preference Optimization and Self-Play Preference Optimization paradigms developed in reinforcement learning and vision–language planning. All share the conceptual move from absolute scalar regression toward relative preference modeling, frequently involving normalization or aggregation to counteract preference noise, intransitivity, and annotator inconsistency (Swamy et al., 2024, Liang et al., 28 Feb 2025). In each domain, these approaches have demonstrated superior stability and sample efficiency in the presence of noisy or heterogeneous human feedback. For instance, self-play and structured preference optimization frameworks generalize the standardization and relative ranking principle beyond MOS prediction to complex, long-horizon task planning and RL settings (Swamy et al., 2024, Liang et al., 28 Feb 2025).

In summary, CLAPScore-based architectures augmented with SPO represent a robust methodological advance for modeling human-like preference in audio–language alignment and can be generalized, with necessary adaptation, to a broader set of multimodal and sequential preference learning domains.