RefCLAPScore: Multimodal Evaluation Metric
- RefCLAPScore is an evaluation metric that integrates cross-modal grounding with reference similarity to assess audio and image outputs.
- It computes a candidateās compatibility and its best reference match, fusing these scores with the harmonic mean for balanced evaluation.
- The metric is applied in tasks like audio source separation and captioning, showing improved alignment with human ratings over reference-free methods.
RefCLAPScore is an evaluation metric designed to measure the quality of multimodal outputsāspecifically audio or image content paired with natural language queries or captionsāin a reference-augmented, cross-modal embedding space. Rooted in the contrastive learning paradigm exemplified by CLAP (Contrastive Language-Audio Pretraining) and CLIP (Contrastive Language-Image Pretraining), RefCLAPScore integrates both modality-to-text grounding and similarity to available references to provide a composite measure of output fidelity and relevance. It extends pure reference-free cosine approaches (such as CLAPScore and CLIPScore) by harmonically combining a candidateās cross-modal compatibility score with its best similarity to any reference, providing a more holistic evaluation signal in applications like language-queried audio source separation, audio captioning, and image captioning.
1. Mathematical Formulation and Core Definition
Given a candidate output (e.g., separated audio or generated caption) and one or more reference ground-truths, RefCLAPScore is computed in three steps:
- Modality Encoding: Both the candidate and the reference are mapped into a shared, -dimensional embedding space using CLAP encoders for audioātext tasks or CLIP encoders for imageātext tasks:
- For audioātext:
Cosine Similarity:
- Each embedding is L2-normalized. The score between and is:
Reference-Augmented Fusion: In settings with multiple reference texts (e.g., image captioning), the highest similarity with any reference is computed, and candidate/reference compatibilities are aggregated using the harmonic mean. For generalization from CLIPScore (Hessel et al., 2021):
- Let be the CLAPScore (candidate vs. audio/image), be (candidate, reference).
- The final composite is:
- This ensures both cross-modal grounding and reference consistency.
2. Underlying Model Architecture and Pretraining Objective
RefCLAPScore relies fundamentally on the architecture and training regimes of CLAP or CLIP:
Audio/Visual Encoder (0/1): Deep convolutional or transformer-based stack, ingesting log-mel spectrograms (audio) or pixels (images), with global pooling, linear projection, and L2 normalization.
Text Encoder (2): BERT-style transformer over tokenized or byte-pair-encoded text.
Contrastive Loss (InfoNCE):
3
where similarity is cosine, and 4 is a (learned or fixed) inverse-temperature parameter.
- Total Loss:
5
- Pretrained Checkpoints: Evaluation typically uses fixed, publicly-released CLAP or CLIP weights; no downstream fine-tuning for scoring is performed.
3. Algorithmic Workflow and Feature Extraction
The end-to-end RefCLAPScore computation includes:
Waveform/Image and Text Preprocessing:
- Audio: Resample (e.g., 16ākHz), windowing (25āms, 10āms hop), mel-spectrogram (typically 64 or 128 bins), per-frequency normalization.
- Text: Tokenization (BPE or WordPiece, 6), max length enforcement and padding/truncation.
- Embedding Computation and Normalization:
9
- Scoring:
0
- Sliding Window Aggregation (optional, especially in captioning): To address instability from window truncation, embeddings may be averaged over temporally-overlapping windows before normalization (SLIDE-CLAP).
4. Benchmarking, Empirical Results, and Comparisons
RefCLAPScore has been evaluated in multiple contexts:
- Audio-Text Alignment (DCASE 2024 Task 9): Used as primary/auxiliary metric for language-queried audio source separation. Table 1 presents the following system-level results (Xiao et al., 2024):
| Method | SDR | SI-SDR | CLAPScore | RefCLAPScore |
|---|---|---|---|---|
| Baseline | 5.71 | 3.86 | 0.24 | 0.25 |
| Baseline-Augm. | 5.94 | 4.19 | 0.24 | 0.25 |
| AudioSep | 8.19 | 6.68 | 0.26 | 0.27 |
- Caption Evaluation (BRACE Benchmark): Maximum F1 on main sub-benchmarks reaches approximately 70 for best LAION-CLAP variant in SLIDE mode; hallucination discrimination is higher, but never exceeds 89 F1 (Guo et al., 11 Dec 2025).
- Correlation with Human Ratings: Correlation (Pearson, 7) with SDR for audio separation use cases is moderate:
- CLAPScore vs. SDR: 0.27; RefCLAPScore vs. SDR: 0.226 (Xiao et al., 2024).
- On image captioning, RefCLIPScore consistently outperforms pure CLIPScore and leading reference-based metrics on literal description tasks (Hessel et al., 2021).
5. Contextualization, Implementation Practices, and Limitations
RefCLAPScore unifies two semantic axes: content grounding in the input modality and overlap with one or more reference outputs. Its usability and performance are subject to several practical considerations:
- Advantages:
- Reference-augmented: Captures both modality fidelity and textual fluency/coverage.
- Generalizability: Architecture-agnostic with respect to type of content (audio, image).
- Computational efficiency: Single forward pass per encoder per pair; minimal preprocessing overhead; public models available.
- Limitations:
- Moderate alignment with classic distortion-based or n-gram metrics (e.g., SDR, CIDEr)āRefCLAPScore is complementary, not substitutive.
- Sensitivity: Does not robustly capture subtle acoustic details or syntactic violations; scores may be insensitive to grammar when keywords match.
- Windowing: Random crop/truncation instability, remedied by sliding aggregation at increased computational cost.
- Pretraining Domain Bias: Performance may degrade for out-of-distribution content or rare modalities not present in pretraining data.
- Best Practices:
- Report both cross-modal and reference-only metrics in experimental studies.
- Apply sliding window aggregation for longer/higher-variance samples (i.e., SLIDE-CLAP).
- For datasets with multiple references, compute 8 over references for each candidate.
6. Applications and Extensions
RefCLAPScore underpins evaluation in a variety of multimodal generation and retrieval settings:
- Language-Queried Audio Source Separation: Allows for direct semantic evaluation of system outputs without requiring ground-truth referencesācritical for real-world data (Xiao et al., 2024).
- Foley Sound Synthesis: Used to replace inefficient post-generation filtering; latent-CLAP loss aligns synthesized content toward high-scoring RefCLAPScore regions, obviating the need for threshold-based sample selection (Karchkhadze et al., 2024).
- Automatic Audio Captioning: Adopted for pairwise preference testing and quality measurement in benchmarks with and without high-quality reference captions (Guo et al., 11 Dec 2025).
- Image Captioning: RefCLIPScore, the direct image counterpart, has established state-of-the-art correlations with expert and crowd judgments on literal datasets when references are supplied (Hessel et al., 2021).
7. Relationship to Related Metrics and Ongoing Directions
RefCLAPScore is structurally analogous to RefCLIPScore for imageātext; both fuse cross-modal compatibility and reference overlap through the harmonic mean, penalizing candidates that are deficient in either aspect. Reference-free metrics such as pure CLAPScore or CLIPScore offer broader applicability but lower fine-grained discrimination, especially in high-quality or adversarial/comparative settings.
Limitations exposed by BRACE (e.g., plateauing F1 on humanāhuman or subtle corruption discrimination) motivate further research into embedding architectures, domain coverage, and calibration. A plausible implication is that future iterations may integrate more granular, context-sensitive or hierarchical fusion approaches to overcome current regime weaknesses in syntactic/granular quality assessment.