Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 218 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too (2509.05440v1)

Published 5 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: As large-LLMs have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.