Unsupervised Evaluation Metrics for Dialogue

Updated 13 October 2025

The paper introduces unsupervised evaluation metrics for dialogue that overcome word overlap limitations using context-aware and causal formulations.
It compares classical methods like BLEU and ROUGE with advanced hybrid approaches such as RUBER and CausalScore, demonstrating varying alignments with human judgments.
The review emphasizes future directions including modular frameworks, integration of causal discovery, and robust, language-agnostic evaluation techniques.

Unsupervised evaluation metrics for dialogue are a class of automatic measures that assess the quality, relevance, or appropriateness of system-generated responses in both open-domain and task-oriented dialogue systems without relying on supervised human labels or task-completion signals. These metrics have evolved from simple word overlap and embedding-based similarity scores to more sophisticated context-aware, reference-free, and causality-driven formulations. The field is driven by the challenge that dialogue response generation is inherently a one-to-many problem, where myriad appropriate responses exist for a given context, rendering surface-level similarity metrics insufficiently reflective of human judgment.

1. Classical Word-Overlap and Embedding-Based Metrics

Early unsupervised evaluation in dialogue repurposed metrics from machine translation and summarization, notably BLEU, METEOR, and ROUGE. These approaches compute lexical similarity between generated and reference responses using n-gram precision (with brevity penalty for BLEU), explicit token alignment with synonym and paraphrase matching (METEOR), or the length of the longest common subsequence (ROUGE-L). Embedding-based metrics introduced a layer of semantic approximation by leveraging static or pretrained word vector representations:

Greedy Matching: Each word in the reference is matched to the most similar word in the candidate using cosine similarity, with symmetry achieved by averaging in both directions.
Embedding Average: Mean of word embeddings per sentence, with response similarity as their cosine similarity.
Vector Extrema: Per-dimension extrema (max or min in absolute value) across sentence word vectors, capturing the most salient words.

Formally, BLEU-N is given by: $\text{BLEU-N} = b(r, \hat{r}) \cdot \exp\left(\sum_{n=1}^N \beta_n \log P_n(r, \hat{r})\right)$ where the n-gram precision $P_n$ is: $P_n (r, \hat{r}) = \frac{\sum_k \min\left[h(k, r), h(k, \hat{r})\right]}{\sum_k h(k, r)}$ where $h(k, \cdot)$ counts n-grams $k$ , and $b(\cdot)$ is the brevity penalty.

While these metrics initially provided computational efficiency and some positive signal in tasks with highly constrained outputs, empirical studies demonstrated poor and often inconsistent correlation with human judgments in non-task-oriented, open-domain contexts; for example, correlations in chitchat-style Twitter datasets were weak and in technical Ubuntu dialogues, effectively null (Liu et al., 2016).

Summary Table: Classical Metrics

Metric	Approach	Key Limitation
BLEU, ROUGE	N-gram overlap	Penalizes legitimate paraphrase
METEOR	Word alignment w/ synonyms	Sensitive to word order/no context
Embedding Avg.	Semantic vector similarity	Overlooks pragmatic relevance

2. Context Awareness and Hybrid Approaches

Recognition of context-blindness and oversensitivity to surface similarity in classical metrics led to more context-sensitive formulations such as RUBER (Tao et al., 2017). RUBER introduced a hybrid metric comprising:

Referenced sub-metric: Cosine similarity between max–min pooled word embeddings of generated and reference replies.
Unreferenced sub-metric: A neural network evaluates the appropriateness of a reply with respect to the preceding query (context), trained using negative sampling without satisfaction labels.

The two scores are normalized to (0,1) and heuristically combined (e.g., min, mean), balancing faithfulness to the reference with generic contextual relevance. Empirical results showed that unreferenced components aligned more strongly with human judgment than reference-based components, with combination strategies outperforming either alone.

3. Reference-Free, Retrieval-Inspired, and Causality-Based Metrics

Recent advances further reduce dependence on references and instead emphasize contextual, interactional, and causal relationships:

USR (Mehri et al., 2020): Decomposes evaluation into sub-metrics for understandability, naturalness, context maintenance, interestingness, and knowledge use, all computed using pre-trained LLMs (e.g., RoBERTa) without ground-truth references. A regression layer combines sub-metrics, aligning well with human ratings at both system and turn level.
FED (Mehri et al., 2020): Utilizes a pretrained conversational model (DialoGPT) to score dialog turns by the log-likelihood of “positive” versus “negative” follow-up utterances, capturing fine-grained dialog properties with reasonable correlation to human opinions.
CausalScore (Feng et al., 25 Jun 2024): Introduces a formal causal strength measure between dialogue history and response using classifier-based (un)conditional independence tests, averaging probabilities over candidate causal utterances. Empirically, CausalScore surpasses both reference-based (BLEU, ROUGE, etc.) and LLM judgment-based metrics in measuring response relevance, indicating that causal dependencies better capture human notions of contextual appropriateness.

Formula Example: CausalScore

$\text{CausalScore}(c, r) = \frac{1}{|U_{dep}|} \sum_{i} p(l=1 | c_i, r) + \frac{1}{|U_{dep}|(|U_{dep}|-1)} \sum_{i \neq j} p(l=1 | c_i, c_j, r)$

CGDIALOG+ (Feng et al., 25 Jun 2024), a dataset annotated with utterance-level causal relations, is used to train and validate these classifiers, supporting robust learning of dialog causality.

4. Multidimensional and Configurable Evaluation

There is increasing recognition that “overall quality” is an amalgam of distinct properties (e.g., fluency, specificity, empathy, coherence). USL-H (Phy et al., 2020) operationalizes this by hierarchically composing:

Understandability (via valid utterance prediction),
Sensibleness (via context/response coherence),
Likability (task-specific: e.g., specificity, empathy).

The metric is calculated as: $s_{USL-H} = \alpha_1 s_U + \alpha_2 s_S + \alpha_3 (s_S \times s_L)$ with tunable weights and modular sub-metrics, yielding a plug-and-play framework whose subcomponents can be swapped as tasks demand. Fine-grained decomposition enables interpretability and adaptation to different dialogue requirements.

Complementary approaches, such as FineD-Eval (Zhang et al., 2022), ensemble coherence, likability, and topic depth; MME-CRS (Zhang et al., 2022) fuses five sub-metrics (fluency, relevance, topic coherence, engagement, specificity) with adaptive, correlation-based rescaling; ABC-Eval (Finch et al., 2022) applies dimensional, binary labels for turn-level behaviors in human evaluation, suggesting a framework for future unsupervised metrics to replicate.

5. Task-Oriented Dialogue and Multi-Level Evaluation

In closed-domain, task-oriented scenarios, response variability is lower. Metrics such as METEOR demonstrate stronger, though not universally high, correlation with human judgment, especially where multiple gold references exist (Sharma et al., 2017). Nevertheless, the community observes that current corpora are often “solved” by simple models, suggesting the need for more challenging datasets.

Newer frameworks such as TD-EVAL (Acikgoz et al., 28 Apr 2025) specifically address the dual nature of TOD evaluation by uniting fine-grained, turn-level rating (covering conversation cohesion, backend knowledge consistency, and policy compliance—often using LLM-based judges) with holistic, pairwise dialogue-level comparisons in an “arena” setup. Elo ratings are adapted to summarize model performance across head-to-head competitions, efficiently surfacing intermediate and overall errors.

6. Limitations, Alignment Challenges, and Future Directions

Empirical studies consistently show that classical metrics poorly reflect human assessment in open-domain dialogue, particularly in the presence of diverse valid outputs or required technical accuracy (Liu et al., 2016). Deficiencies include insensitivity to paraphrase, context neglect, inability to weight salient content, and overreliance on token-level similarity.

Recent metrics leveraging context, unsupervised neural representations, causality, or multitask frameworks demonstrate stronger human correlation, but are not without limitations. These include potential failure to generalize across domains, need for carefully curated or annotated data (e.g., for context-targeted classifier training), vulnerability to “gaming,” and incomplete coverage of conversational facets (such as consistency, inquisitiveness, or error recovery).

Future research is marked by several trends:

Integration of causal discovery into reference-free evaluation,
Modular, multi-dimensional frameworks (allowing pluggable submetrics for task adaptation),
Robustness across languages and paraphrastic responses (e.g., via LLM ensemble prompting (Mendonça et al., 2023), multilingual benchmarks (Zhang et al., 2023)),
Direct automation of fine-grained human behavior annotations, and
Development of challenging new corpora that drive progress in both metric and system advancement.

Metrics that blend interpretability, context-sensitivity, multidimensionality, and valid causal abstraction are positioned to offer more faithful and actionable proxies for human judgment in dialogue system evaluation.