MTJ-Bench-ir: Irrelevant Follow-Ups in Dialogue Eval
- The paper introduces a framework that employs fixed negative follow-up prompts to assess dialogue quality by capturing off-topic or unsatisfactory responses.
- It utilizes both log-likelihood scoring (FULL) and learning-to-rank retrieval models to simulate semantic irrelevance with high empirical correlation to human ratings.
- The approach provides actionable insights for improving conversational agents by identifying systematic dialogue failures and guiding model calibration.
Irrelevant Follow-Up (MTJ-Bench-ir) in Open-Domain Dialogue Evaluation
Irrelevant Follow-Up (MTJ-Bench-ir) refers to the validation of open-domain conversational agents through the detection and scoring of “irrelevant” or semantically off-topic follow-up utterances. The paradigm operationalizes dialog quality by measuring the likelihood that a generic LLM would continue a given dialog turn with one of several fixed negative prompts (“Not really relevant here.”, “You’re really confusing.”), which act as proxies for human dissatisfaction or conversational breakdown. Approaches in this domain encompass both direct log-likelihood scoring (“FULL”) and learning-to-rank retrieval models where irrelevant, paraphrased, or confounded questions are generated and used as negative evaluation candidates. This framework distinguishes itself from previous reference-free metrics by focusing explicitly on irrelevance detection via fixed utterances, showing strong empirical correlation with human ratings.
1. Background and Definition of Irrelevant Follow-Up Prompts
Irrelevant follow-up prompts serve as canonical negative feedback utterances that are likely to be generated in response to unsatisfactory system turns. In automated dialog evaluation, Mehri & Eskenazi’s FED metric originally introduced a curated set of 63 follow-ups capturing 16 dimensions of interaction quality, including relevance and engagement. De Bruyn et al. refined this approach by selecting five top-scoring prompts for irrelevance detection: “Not really relevant here.”, “You’re really confusing.”, “You’re really boring.”, “What are you trying to say?”, and “You don’t seem interested.” (Bruyn et al., 2022). These utterances are short, context-independent, and effectively capture instances where a system response diverges from the dialog topic or fails semantically. Selection is based on statistical correlation (Spearman’s ρ) with human overall quality ratings at both turn and dialog levels.
2. FULL: Follow-Up Log-Likelihood Scoring Framework
The Follow-Up Log-Likelihood (FULL) evaluation framework computes the probability that a pretrained LLM (M, specifically BlenderBot-400M) will continue a conversation with each of the fixed follow-up prompts given the current context. Formally, given a dialog history and a candidate system response , the context is concatenated and passed to . For each follow-up :
The overall FULL score is then:
A higher indicates that none of the negative follow-ups is likely under the context, signifying a well-formed, relevant system turn. Scoring requires separate forward passes for each prompt.
3. Construction and Analysis of Irrelevant Candidates in Retrieval-Based Frameworks
Retrieval-based dialog evaluation frameworks generate “invalid” candidates to simulate various linguistic and semantic confounders. In "Learning to Retrieve Engaging Follow-Up Queries," the Follow-up Query Bank (FQ-Bank) dataset leverages six confounder classes to generate negatives: paraphrase confounders, irrelevant-entity/partial-entity substitution, irrelevant-context shifts, ASR error simulations (homophone substitution), random unrelated questions, and duplication of prior dialog history (Richardson et al., 2023). The construction employs algorithmic paraphrasing (BART), named entity replacement (spaCy/WikiData), and context mixing, yielding a collective distribution as summarized in the table below:
| Confounder Type | Generation Algorithm | Train Share (%) |
|---|---|---|
| Paraphrase | BART paraphrasing of | 8.7 |
| Irrelevant-Entity/Partial | Entity replacement from WikiData | 26.0 |
| Irrelevant-Context | Template with matching entity type | 14.6 |
| ASR-Error | Homophone substitution for entity | 24.2 |
| Random-Question | Arbitrary OR-QuAC sample | 11.5 |
| Duplication | Repeat earlier question from history | 15.0 |
This process ensures diverse irrelevance, ranging from factual incoherence (irrelevant-context) to syntactic confusion (ASR errors).
4. Experimental Protocols and Evaluation Metrics
Open-domain dialog evaluation through irrelevant follow-ups relies on benchmark datasets (e.g., the FED benchmark: 372 turn-level and 124 dialog-level annotated contexts) (Bruyn et al., 2022). Human annotators rate overall conversational quality, providing a gold standard for metric calibration. Baselines for comparison include reference-free metrics such as QuestEval, MAUDE, DEB, GRADE, DynaEval, USR, USL-H, DialoRPT, HolisticEval, PredictiveEngage, original FED, and FlowScore.
For retrieval models (Richardson et al., 2023), the evaluation utilizes Mean Reciprocal Rank (MRR) and Recall@k (Hit@k), defined as:
Supervised transformer-based models (BERT-base, RoBERTa-base) are optimized using binary cross-entropy loss over positive/negative follow-ups, achieving high MRR (0.805–0.808) and Hit@1 of approximately 68% on test splits. Unsupervised baselines using GloVe or SBERT embeddings yield substantially lower performance.
5. Results and Sensitivity to Irrelevance
Irrelevant follow-up prompts yield high empirical Spearman’s ρ with human dialog quality:
- “Not really relevant here.”: = 0.48, = 0.65
- “That’s not really relevant here.”: = 0.45, = 0.70
The aggregate FULL score provides the highest observed correlations among all evaluated metrics: = 0.51, = 0.69 (Bruyn et al., 2022). This confirms that off-topic system utterances are efficiently captured via the probability assigned to irrelevance-oriented prompts. Retrieval models similarly demonstrate strong ability to reject explicit confounders (duplication, ASR, random), but “irrelevant-context” negatives remain challenging due to surface-level semantic coherence masking factual irrelevance.
There is a marked asymmetry: negative feedback (“irrelevant”/“confusing”) prompts correlate more robustly with human ratings (mean ) than positive ones (“Great talking to you.”, ).
6. Limitations and Future Directions
Limitations stem from domain specificity and model calibration. Generic dialog models may over-penalize valid but rare or highly technical turns, while certain pragmatic cues (“You don’t seem interested.”) may misfire for fact-based responses. FQ-Bank is synthesized and may not fully represent natural assistant usage; further confounder modes such as timeliness or user preferences are not included (Richardson et al., 2023).
Suggested future improvements include domain-specific adaptation of irrelevant follow-up sets, integration of knowledge-base grounding for entity verification, and annotation of real-world assistant data for enhanced model robustness. The core insight persists: scoring the likelihood of an “irrelevant” prompt provides a strong, unsupervised signal for dialog quality detection, outperforming more complex graph- or contrast-based evaluation strategies. A plausible implication is that further refinement in the selection and contextualization of negative follow-ups could yield even higher alignment with human conversational quality assessment.