Context-Dependent Evaluation Metrics

Updated 6 November 2025

Context-Dependent Evaluation Metrics are quantitative measures that incorporate situational factors, such as discourse history and task-specific conditions, to better reflect human judgment.
They employ composite signals, dynamic weighting, and context-infused approaches to assess coherence, relevance, and semantic similarity more effectively.
Empirical evidence shows that using context-aware metrics significantly improves alignment with human evaluations in tasks like dialogue, translation, and narrative generation.

Context-dependent evaluation metrics are quantitative measures designed to assess the quality or appropriateness of models or system outputs by explicitly considering contextual information—such as discourse history, input document, conversation, prior labels, or application-specific conditions—instead of evaluating in isolation. These metrics are formulated to align more closely with human judgment and task requirements, which are inherently context-sensitive. They address the limitations of classical, context-agnostic metrics that often fail to capture phenomena like coherence, contextual fit, or the influence of surrounding data on meaning, relevance, or utility.

1. Fundamental Principles of Context-Dependency in Evaluation

Context-dependent evaluation metrics are motivated by the observation that system performance or output quality cannot be adequately measured without accounting for extrinsic situations, histories, or inputs that affect interpretation or task success. In language tasks, for example, naturalness, coherence, and appropriateness are not solely local properties of an utterance or sentence, but depend on their relationship to adjoining discourse or input prompts. Similarly, in similarity measurement, the perceived distance between objects can shift depending on the distribution and properties of the dataset in which they are embedded (1304.1084).

A central design aspect is that these metrics must adjust their computation or weighting based on the context, with the aim of reflecting distinctions and dependencies that are salient to humans or crucial for the end application. This can involve dynamic attribute weighting, two-stage computation (e.g., similarity plus contextual fit), or the use of composite signals like those derived from neural models trained with context-aware objectives.

2. Methodological Approaches to Context-Dependent Metrics

2.1 Composite and Augmented Metrics

Several paradigms have been proposed for constructing context-dependent metrics:

Composite Metrics: Combine signals for meaning preservation and contextual compatibility. CtxSimFit for stylistic text rewriting is a prototypical example:

$\text{CtxSimFit} = \alpha \cdot \mathrm{BERTSCORE}(S, X) + (1 - \alpha) \cdot \mathrm{NSP}(C, X)$

where $S$ is the original sentence, $X$ the rewrite, $C$ the preceding context, $\mathrm{BERTSCORE}$ quantifies semantic similarity, and $\mathrm{NSP}$ (Next Sentence Prediction) measures contextual cohesiveness (Yerukola et al., 2023).

Context-Infused Variants: Standard metrics are adapted by concatenating or prepending context to the input before computing similarity or alignment, e.g., $\text{sim}(C + I, X)$ . Examples include ROUGE\textsuperscript{Ctx}, SBERT\textsuperscript{Ctx}, and related BERTScore variants.
Perturbation or Attribution-Based Metrics: In document-level machine translation, metrics like CXMI (Contextual Cross-Mutual Information) use entropy differences with and without context to quantitatively assess the sensitivity and utility of context (Mohammed et al., 2 Feb 2024).

2.2 Context-Driven Metric Families

Other approaches formalize the context-dependence structurally:

Contextual Weighting in Similarity: Attribute-weighted dissimilarity, where the importance of each attribute is a function of its distribution in the dataset, e.g.,

$d(c_i, c_k) = \sum_{j=1}^n h(p_j) \cdot e(C_{ij}, C_{kj})$

where $h(p)$ is typically an entropy-motivated function decreasing for rare or uniform attributes (1304.1084).

Contextual Topic Coherence: Contextualized Topic Coherence (CTC) metrics use LLMs or masked LLMs to evaluate whether top-N topic words cohere in real sentence contexts, outperforming co-occurrence-based metrics (Rahimi et al., 2023).
Rank Aggregation and Calibration: MetaMetrics learns a combination of multiple existing metrics, calibrated per context using human judgment as supervision, to maximize alignment with user preferences in specific environments (Winata et al., 3 Oct 2024).

3. Domains of Application

3.1 Natural Language Generation and Stylistic Rewriting

In tasks such as formality transfer, sentiment manipulation, and detoxification, classic metrics like ROUGE or SBERT were found to poorly align with human preferences ( $\rho=0.09-0.23$ ). Incorporating context, such as with CtxSimFit ( $\rho=0.7–0.9$ ), dramatically improves correlation with perceived fit and naturalness (Yerukola et al., 2023). Human raters strongly prefer outputs evaluated and generated in context, particularly in multi-turn dialogs and document-level tasks.

3.2 Dialogue and Open-Ended Generation

BLEU and similar n-gram-based metrics fail in conversational settings due to the diversity of valid responses. Contextualized metrics, for instance using BERT embeddings for both referenced and unreferenced relevance, demonstrate much higher correlation with human judgments (Spearman 0.45 vs 0.23) (Ghazarian et al., 2019). This trend holds broadly in dialogue, question answering, and validation of LLMs as judges (Xu et al., 19 Mar 2025).

3.3 Machine Translation and Retrieval-Augmented Generation

In document-level MT, both contrastive evaluation (accuracy on ambiguous items) and context utilization (measured with perturbation or attribution) are necessary. Attribution-based analysis demonstrates which context tokens actually contribute to resolving discourse phenomena—complementing (but not replacing) accuracy-based evaluation (Mohammed et al., 2 Feb 2024). New extraction pipelines (CTXPRO) and diagnostic suites also provide context-dependent subsets for focused evaluation (Wicks et al., 2023).

Retrieval-augmented generation contexts are now measured for completeness and redundancy through question-based coverage metrics, e.g.,

$Cov(Z) = \frac{|\{q \in Q \mid \max(G(p \in Z, q, I_g)) \geq \eta\}|}{|Q|}$

where $Z$ is the set of retrieved passages and $Q$ is a set of required sub-questions from a gold summary (Ju et al., 24 Jun 2025).

4. Experimental Performance and Human Alignment

The inclusion of context in metrics substantially raises their alignment with human evaluations across modalities:

Metric Type	Typical Human Corr. (Spearman’s $\rho$ )	Robustness/Significance
Sentence-level	0.09–0.3	Weak/Not significant (Yerukola et al., 2023)
Contextual	0.4–0.7	Significant improvement (Yerukola et al., 2023, Agrawal et al., 13 Mar 2024)
Composite (e.g., CtxSimFit)	0.7–0.9	Highly significant (Yerukola et al., 2023)

These results are consistent across stylistic rewriting, chat translation (Context-MQM outperforms non-contextual metrics), topic modeling, open-domain dialogue, and compositional vision-language tasks (Agrawal et al., 13 Mar 2024, Rahimi et al., 2023, Kasaei et al., 25 Sep 2025).

Experiments also show that context inclusion is most beneficial where ambiguity or contextual referent resolution is required (e.g., pronouns, ellipsis, topic coherence), and can be counterproductive or neutral when context is irrelevant or the context signal is of poor quality.

5. Limitations of Context-Agnostic Metrics and Challenges

Traditional metrics are often context-ignorant and thereby fail in several dimensions:

Failure to detect incoherence or ambiguity introduced by ignoring context, especially for generative or adaptive NLG settings (Yerukola et al., 2023).
Limited discrimination power on context-sensitive phenomena in MT or QA where a shift in context alters the correct outcome (Wicks et al., 2023).
Neglect of user or practitioner-specific requirements; context-dependent utility and acceptability are not captured (Shrivastava et al., 14 Apr 2025).
Biases and adverse behaviors such as over- or under-generation of detail with changing context length, not reflected in baseline metrics (An et al., 2023).
Non-transferability across domains due to reliance on surface overlap or static weighting (Sai et al., 2020).

A plausible implication is that metrics must undergo context-aware validation and benchmarking, as performance ranking can vary by system, task, and dataset (Wei, 2019).

6. Implications for Metric Design and Best Practices

Context-dependent evaluation has foundational implications:

Metric design must align with desired human and stakeholder judgments, often requiring composite or calibratable models (e.g., MetaMetrics (Winata et al., 3 Oct 2024), DICE framework (Shrivastava et al., 14 Apr 2025)).
Evaluation suites should include diverse, context-rich, and annotated datasets (e.g., CTXPRO, discourse-rich test sets (Wicks et al., 2023, Mohammed et al., 2 Feb 2024)) to make effects of context measurable and meaningful.
Interpretability and diagnostic power are increased: attribution, perturbation, and coverage metrics enable fine-grained error analysis.
Conditional and hierarchical evaluation criteria (e.g., ContextualJudgeBench) are critical for tasks involving multi-step or practitioner-aligned priorities (Xu et al., 19 Mar 2025).

Best practices for validation include collecting system outputs spanning the expected context-dependent spectrum, leveraging direct human assessment, applying statistical analyses at both segment and system level, and reporting context conditions for each evaluation (Wei, 2019).

7. Open Challenges and Future Directions

Scaling and transferability: Many context-dependent metrics are currently task- or application-specific and may require reengineering for new environments.
Efficiency: LLM-based or attribution-based metrics can be computationally intensive.
Data requirements: Context-rich annotated datasets are still rare for many phenomena and languages.
Pluralism and subjectivity: With increasing stakeholder participation (cf. DICE), context-specific weighting and aggregation may yield divergent metric priorities (Shrivastava et al., 14 Apr 2025).
Composite calibration: Meta-metrics like MetaMetrics provide a principled way to optimally combine and calibrate context-dependent signals per task and domain (Winata et al., 3 Oct 2024).

A plausible implication is that as models and systems become ever more context-adaptive, evaluators will need to combine advances in context-aware metric design, automated calibration, and practitioner-specific criteria to achieve reliability and human alignment. This suggests the ongoing need for research into both methodology and infrastructure for context-rich, dynamic, and transparent evaluation.