EvalGen: Mixed-Initiative LLM Evaluation

Updated 3 November 2025

EvalGen is a mixed-initiative system that blends human judgment with LLM assistance to iteratively refine evaluation criteria in open-ended tasks.
It employs a guided workflow where users grade outputs and adjust both natural-language criteria and code/LLM-based assertions based on alignment metrics like Coverage and FFR.
The system enhances trust and reliability by transparently reporting alignment metrics and accommodating evolving, context-dependent criteria through continual feedback.

EvalGen is a mixed-initiative system and interface for the human-centered, iterative alignment of automated evaluation criteria and assertion implementations for LLM output pipelines. Designed as an extension to the ChainForge tool suite, EvalGen assists practitioners in prototyping, selecting, and validating both natural-language evaluation criteria and executable assertions (code-based or LLM-based grading prompts) for diverse, open-ended task domains. The system is motivated by challenges in the reliable, scalable assessment of generative model outputs where human judgment is decisive, but direct human grading is costly and problematic to scale. EvalGen’s key innovation is to foreground the subjectivity, drift, and context-dependence of evaluation criteria—offering not only automated assistance but also a transparent, feedback-driven calibration process for assertion selection.

1. Motivation and Context

EvalGen was introduced to address the limitations inherent in LLM-assisted evaluation of LLM outputs, especially in domains where existing automated grading (either by LLMs themselves or fixed assertion code) fails to reflect user intent or human preferences. Automated validators, when uncritically adopted, propagate the subjectivity, prompt-sensitivity, and misalignment issues from the underlying models they assess. The need for validated and user-aligned evaluators is acute in open-ended tasks, creative domains, or custom application pipelines, where the operational semantics of correctness cannot be defined in advance—or may mutate as users encounter more outputs. EvalGen operationalizes a mixed-initiative paradigm: users and LLMs cocreate and iteratively refine evaluation flows, leveraging user grading for calibration and diagnostic feedback.

2. Architecture and Mixed-Initiative Workflow

The EvalGen workflow is structured as a set of interactive, wizard-driven panels with asynchronous back-end assertion generation. Core steps include:

Initiation: Users activate EvalGen from a prompt node in ChainForge, choosing whether to infer criteria using an LLM, manually enter criteria, or begin by grading outputs.
Criteria Specification and Refinement: LLMs suggest candidate criteria mapped to the user's task. Users may edit, delete, or supplement these criteria, selecting either code-based or LLM-based graders for each.
Human Grading of Outputs: Users grade sampled outputs (“good” or “bad”) on a subset of generated responses. Grading informs both the definition of criteria and the calibration of assertion implementations.
Assertion Generation and Selection: LLMs synthesize multiple candidate implementations for each criterion. Each is tested on the graded outputs, and ranked by how well their failure/success predictions align with user grades.
Alignment Feedback: Confusion matrices, coverage statistics, and false failure rates for each implementation are surfaced to users. The selected assertions are those maximizing alignment under constraints.
Reporting and Iteration: Users inspect ungraded outputs with assigned labels, review report cards with aggregate alignment metrics, and may iterate further by adjusting criteria or grading more outputs.

All steps are designed to be non-linear and allow users to “drift” criteria as new cases or edge conditions emerge.

3. Technical Details and Alignment Algorithms

EvalGen’s assertion selection formalizes alignment using metrics inspired by F1, explicit coverage, and false failure rates, calculated over user-graded outputs. For a set of assertions $F$ , user labels $y$ , and outputs $e_i$ :

Coverage (True Negative Rate):

$\text{Coverage}(F) = \frac{ \sum_{i} \mathbb{I}[y_i=0 \land (\exists f \in F: f(e_i)=0)]}{ \sum_{i} \mathbb{I}[y_i=0] }$

False Failure Rate (FFR):

$\text{FFR}(F) = \frac{ \sum_{i} \mathbb{I}[y_i=1 \land (\exists f \in F: f(e_i)=0)]}{ \sum_{i} \mathbb{I}[y_i=1] }$

Alignment Score (harmonic mean):

$\text{Alignment}(F) = 2 \cdot \frac{ \text{Coverage}(F) \cdot (1-\text{FFR}(F)) }{ \text{Coverage}(F) + (1-\text{FFR}(F)) }$

The per-criterion selection process maximizes alignment subject to minimum coverage and FFR constraints. Assertion generation leverages LLMs for synthesizing both code snippets and grader prompts, which are dynamically scored and pruned based on runtime/semantic errors and discriminatory power over the graded subset.

Sampling policies for grading (random, highest/lowest confidence, alternating) are used to maximize informativeness and representation of true alignment. Alternating sampling is shown to yield better assertion calibration relative to purely random policies.

4. Criteria Drift and Its Procedural Implications

A distinctive phenomenon observed is criteria drift—the iterative, sometimes circular evolution of evaluation criteria as users grade outputs. Users often cannot specify comprehensive, objective criteria a priori; their own standards shift as they encounter new edge cases and see actual model behaviors. Grading cases both requires and revises operational criteria, sometimes even leading to contradictory or evolving labeling within the same task session. Criteria depend on the output distribution: as users encounter model peculiarities or failures, their conceptualization of “good” vs. “bad” adapts. This challenges any fixed, up-front evaluation and motivates EvalGen’s flexible, continual-validation approach.

5. Qualitative Study and Comparative Evaluation

EvalGen’s design and efficacy were assessed via an industry practitioner paper (n=9), which revealed several salient patterns:

Users valued assisted suggestion for both criteria and assertions, especially for overcoming “writer’s block.”
Grading typically occurred both before and after criteria selection—users iterated between eyeballing outputs, refining criteria, and inspecting assertion alignment report cards.
Code-based assertions were more trusted and more likely to be edited by users; LLM-based graders, while essential for subjective tasks, required calibration before trust, due to their opacity.
Iteration and drift were ubiquitous: no user’s initial criteria matched their final operational labels.
Alignment reporting (visualization, confusion matrices, result tables) was critical for earning user trust.

Baseline comparisons against SPADE (the fully automated assertion selection tool) showed EvalGen produced smaller, equally or better-aligned assertion sets, as measured by coverage and FFR.

6. Broader Implications for LLM Evaluation Tools

EvalGen’s mixed-initiative methodology highlights several essential principles for next-generation evaluation assistants:

Iterative Human-Led Calibration: Evaluation must accommodate drifting criteria and subjective revision; tools should facilitate seamless alternation between grading and criteria specification.
Transparency and Trust: Surfacing detailed alignment metrics, disagreement, and assertion performance is necessary for adoption, especially in production or collaborative contexts.
Support for Heterogeneous Assertion Types: Both code and LLM-based assertions must be accommodated and independently aligned/calibrated.
Alignment over Minimization: Focusing on maximizing criterion-alignment with operator labels, rather than just assertion minimization, is crucial for practical utility.
Awareness of Subjectivity: Accepting that “ground truth” in evaluation is often contextual and social, not fixed, is vital for tool and workflow design.

Summary Table: EvalGen System Features

Feature	Description	Significance
Mixed-Initiative Workflow	Human+LLM co-design, grading, and assertion selection	Addresses subjective, iterative nature of evaluation
Assertion Calibration	Alignment formulas (Coverage, FFR, harmonic mean)	Formalizes criteria-implementation agreement
Criteria Drift Handling	Interactive revision, output-driven criteria mutation	Supports context-sensitive task domains
Transparency/Reporting	Confusion matrices, report cards, interactive result tables	Critical for trust and operational deployment
Baseline Comparison	SPADE (automated), smaller sets and better alignment	Demonstrates efficacy of human-in-the-loop paradigm

Conclusion

EvalGen embodies a robust, transparent, and iterative framework for LLM output evaluation, centering human judgment while leveraging automated and LLM-generated assertion techniques. Its alignment-focused, mixed-initiative pipeline is adapted for the inevitable subjectivity and criteria drift observed in practical application, enabling both rigorous offline calibration and scalable future deployment of user-aligned validators in diverse generative AI pipelines (Shankar et al., 18 Apr 2024).

PDF Markdown Chat (Pro)

References (1)

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences (2024)

Follow Topic

Get notified by email when new papers are published related to EvalGen.