Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation
This presentation introduces BinEval, a novel framework that transforms opaque large language model evaluation into transparent, actionable feedback by decomposing holistic ratings into atomic binary questions. Rather than delivering inscrutable scores, BinEval generates precise yes-or-no queries for each quality dimension—consistency, coherence, fluency, relevance—and aggregates their answers into interpretable diagnostics. The approach not only matches or outperforms existing evaluators in human alignment but also enables iterative prompt optimization by surfacing exactly where and why outputs fail, making model improvement systematic rather than speculative.Script
Evaluating language model outputs today is like grading an essay with a single number and no rubric. You get a score, but no understanding of what went wrong or how to fix it.
BinEval solves this by decomposing evaluation into atomic binary questions. Instead of asking if a summary is good, it asks: Are all named entities accurate? Is there evidence of hallucination? Does each sentence follow logically? Each question targets a single checkable property, and the verdicts aggregate into transparent per-dimension scores.
The method works by meta-prompting an Large Language Model to generate questions tailored to the task, then answering each with a verdict and explanation. The result is not just a score, but a diagnostic map showing exactly which criteria passed and which failed.
On the SummEval benchmark, BinEval achieves a consistency correlation of 0.655 with human annotators, outperforming holistic evaluators that compress errors into single ratings. This advantage is most pronounced for factual consistency, where decomposed checks catch subtle hallucinations that black-box judges miss entirely.
BinEval's transparency enables a second breakthrough: iterative prompt optimization. When questions reveal patterns of failure, a note-taker extracts lessons and rewrites the prompt. Over just two iterations, this self-update loop lifts evaluator-human correlation by up to 0.136, turning vague feedback into concrete refinement.
BinEval proves that interpretability and performance are not trade-offs but complements. By asking precise questions instead of rendering vague judgments, we gain both the diagnostic clarity to fix what is broken and the empirical rigor to trust what works. Learn more and create your own research videos at EmergentMind.com.