MAJ-EVAL: Multi-Agent Judge Evaluation
- MAJ-EVAL is a multi-dimensional evaluation framework that aligns AI outputs with human judgments by orchestrating persona-driven LLM debates.
- It systematically extracts evaluator personas from domain texts, enabling scoring across dimensions like fluency, factual consistency, and appropriateness.
- Its structured process—comprising independent scoring, in-group debates, and aggregation—yields higher correlation with human ratings compared to traditional methods.
The MAJ-EVAL framework is a Multi-Agent-as-Judge evaluation paradigm designed to align automated scoring of outputs from LLM-based and other AI systems with human multi-dimensional evaluation practices. It specifically targets scenarios requiring evaluation across diverse stakeholder perspectives, with a protocol that builds multiple detailed evaluator personas from relevant domain documents, instantiates LLM agents with these personas, and orchestrates in-group multi-agent debates culminating in aggregated, multi-dimensional assessment scores (Chen et al., 28 Jul 2025).
1. Formal Problem Setting and Motivation
MAJ-EVAL is motivated by the observation that real-world evaluation tasks often involve collaboration and subjective judgment across multiple facets or “dimensions” of quality (e.g., fluency, factual consistency, domain appropriateness), which historically require input from diverse human stakeholders. Let denote the set of candidate outputs to be evaluated (e.g., answer pairs, summaries), and the set of evaluation dimensions. The “gold standard” human evaluation is defined as a mapping
that assigns a vector of scores per output. MAJ-EVAL seeks to approximate by leveraging persona-driven LLM agents whose orchestrated debate yields a final score , with the goal of improving alignment with human multi-dimensional judgment beyond what pre-existing metrics or agentic protocols accomplish (Chen et al., 28 Jul 2025).
2. Persona Extraction and Construction
The core innovation in MAJ-EVAL is its approach to systematically extracting and constructing persona agents. The process begins by ingesting domain-relevant text corpora (such as research papers or qualitative studies). For each document, a prompt-driven extraction via the LLM yields tuples
where is a stakeholder name (e.g., "teachers"), a description, and pairs each evaluative dimension with concrete supporting evidence from the domain source.
Stakeholder names are clustered by semantic similarity to form groups , after which dimension-evidence sets are merged within each group. Each dimension–evidence pair is expanded into a persona via LLM prompting, ensuring inclusion of demographic, specialty, psychological, and relational attributes. The resulting set of personas covers the key stakeholder perspectives required for robust, multi-dimensional evaluation.
3. Agent Instantiation and Debate Protocol
Each constructed persona is transformed into a distinct agent by embedding the persona’s attributes into a system-level LLM prompt. The protocol for evaluation unfolds in three phases:
- Phase 1 (Independent Evaluation): Each agent produces initial scores for all dimensions on the candidate .
- Phase 2 (Free-form In-Group Debate): Agents in the same stakeholder group engage in a coordinated debate (up to rounds), exchanging rationales and critiques. The debate continues until all agents in the group emit a termination token (“NO MORE COMMENTS”).
- Phase 3 (Aggregation): Final group feedback and quantitative ratings are computed by aggregating post-debate agent scores.
This debate mechanism is structured to enable deliberation and rational conflict resolution within groups, refining initial judgments via multi-turn communication.
4. Scoring, Aggregation, and Output
MAJ-EVAL’s outputs are quantitative ratings per dimension for each candidate output. Individual agent judgments are first aggregated within stakeholder groups:
where is the group size. Cross-group aggregation yields the final score per dimension:
with unweighted means across groups as the primary aggregation method. The process enforces structured, multi-dimensional, and stakeholder-aware scoring, facilitating granular analysis of system outputs along axes mirroring real human evaluation.
5. Experimental Evaluation and Quantitative Results
MAJ-EVAL was empirically validated on StorySparkQA (children’s QA generation— evaluation dimensions: Grammar, Answer Relevancy, Contextual Consistency, Educational Appropriateness) and MSLR-Cochrane (medical multi-document summarization— dimensions: Fluency, PIOConsistency, Effect Direction, Evidence Strength). Gold-standard human ratings on these axes were available.
Baselines included conventional automated metrics (ROUGE-L, BERTScore), single-LLM protocols (G-Eval with GPT-4, Claude-3.7, Qwen-3-235B), and a multi-agent baseline (ChatEval). Alignment with human scores was measured via Spearman’s , Kendall’s , and Pearson’s .
| Method | Grammar (GC) | AnswerRelev. (AR) | ContextCons. (CC) | EducAppropr. (EA) |
|---|---|---|---|---|
| ROUGE-L | .32 | .12 | –.08 | .08 |
| BERTScore | –.04 | .02 | –.07 | –.01 |
| G-Eval (Claude) | .20 | .39 | .19 | .20 |
| ChatEval (Claude) | .24 | .31 | .34 | .16 |
| MAJ-Eval (Claude) | .33 | .45 | .33 | .40 |
| MAJ-Eval (Qwen) | .27 | .43 | .27 | .33 |
MAJ-EVAL consistently yielded stronger Spearman correlation with human raters, particularly in domain-dependent dimensions such as Educational Appropriateness and Effect Direction. Ablation studies established that debate and detailed persona construction contributed 5–10 points improvements in ρ.
6. Applicability, Generalizability, and Limitations
MAJ-EVAL is applicable to any natural language generation (NLG) evaluation context where stakeholder perspectives can be extracted from domain documents. Supported applications include QA systems for specific demographics, summarization for expert and lay audiences, and dialogue evaluation with user and operator concerns.
The framework is contingent on the availability and representativeness of high-quality, up-to-date qualitative studies, as these underpin persona fidelity. Computational cost per evaluation (0.4 USD/task) exceeds that of single-LLM methods but remains practical for offline use. A plausible implication is that MAJ-EVAL may be less suitable for high-throughput, real-time system evaluation.
7. Extensions and Future Directions
Potential improvements to MAJ-EVAL include integrating human-collected rationales to fine-tune evaluator agents with reinforcement learning, exploring adaptive weighting in cross-group aggregation, and extending compatibility to more compact LLMs for efficiency. These directions target improving both alignment with nuanced human judgment and cost-effectiveness in large-scale evaluation settings. The framework’s generalizability suggests long-term potential as an infrastructure for robust, stakeholder-centric agent evaluation across AI domains (Chen et al., 28 Jul 2025).