This paper introduces EvalAgent, a framework designed to automatically discover implicit evaluation criteria for LLM outputs on complex writing tasks (Wadhwa et al., 21 Apr 2025 ). The core problem addressed is that standard evaluation often relies on criteria explicitly stated in the prompt (e.g., "write an academic talk") or very obvious unstated ones (e.g., "be coherent"), missing the nuanced, task-specific qualities that define high-quality writing (e.g., an academic talk should have a compelling opening, clear research questions, and a takeaway).
EvalAgent aims to uncover these implicit criteria—unstated but desired properties specific to the task—by leveraging expert knowledge available on the web. The framework operates in several steps:
- Query Generator: Given a user prompt, an LLM generates conceptual search queries (e.g., "how to draft an academic talk," "how to write an engaging talk") designed to retrieve instructional web documents relevant to the type of writing requested, not just the topic.
- Expert Retriever: For each query, it searches the web, retrieves URLs, and filters them based on expertise and relevance to the original prompt using an LLM scorer. It then extracts answers to the query from the top-ranked filtered documents (e.g., university websites, expert blogs) and summarizes these answers into a query-specific list of criteria.
- Criteria Generator: It aggregates the criteria lists generated for all queries, synthesizes them into a unified list, and rewrites them to be specific evaluation points aligned with the original user prompt (e.g., "the response should focus on big picture questions").
- Ranking: The generated criteria are ranked based on their relevance to the user prompt using an LLM.
The paper proposes that ideal criteria should possess three properties:
- Specificity (S): Using precise, less common terms (measured by Normalized Inverse Word Frequency).
- Implicitness (I): Not directly overlapping with the words in the original prompt (measured by 1 - Word Overlap).
- Actionability (A): Enabling tangible improvements when used to guide response revision (measured by the success rate of revising an initial response to satisfy the criterion).
Experiments were conducted across nine datasets, including a newly collected dataset called Ask-then-Critique, where users evaluated LLM responses to their own prompts. EvalAgent's generated criteria (EA-Web) were compared against baselines like Instruction Decomposition (ID - criteria explicitly from the prompt) and LLM-prompted criteria (LLM and LLM-n, which generates more criteria and ranks them).
Key findings include:
- EvalAgent criteria exhibit higher specificity and implicitness scores compared to LLM-generated criteria.
- Human evaluations rated EvalAgent criteria as less obvious than LLM-generated ones while maintaining high utility.
- EvalAgent criteria demonstrated higher actionability, leading to larger improvements when used for response refinement and identifying criteria that initial model outputs often failed to meet.
- Combining EvalAgent criteria with LLM-generated criteria (EA-Full) resulted in higher recall of human-written criteria compared to purely LLM-based methods generating a similar number of criteria.
The main contributions are the EvalAgent framework itself, the introduction of metrics to evaluate criteria quality (Specificity, Implicitness, Actionability), and the demonstration that mining web-based expert advice allows for the scalable generation of nuanced, actionable, and human-aligned evaluation criteria for LLMs.