Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s
GPT-5 High 17 tok/s Pro
GPT-4o 77 tok/s
GPT OSS 120B 476 tok/s Pro
Kimi K2 232 tok/s Pro
2000 character limit reached

Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation (2506.10403v1)

Published 12 Jun 2025 in cs.LG and cs.AI

Abstract: LLMs are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that synthesizing evaluation programs with weak supervision reduces costs and biases while matching LLM-as-a-judge performance.
  • It presents PAJAMA, where executable Python programs use clear criteria like semantic similarity and lexical diversity to evaluate LLM outputs.
  • Experimental results show a 15.83% increase in evaluation consistency and a 23.7% decrease in biased responses compared to traditional evaluation methods.

Evaluation of LLMs Through Programmatic Judges

The paper "Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation" presents an innovative approach for evaluating LLMs that deviates from traditional LLM-as-a-judge methods. The authors propose PAJAMA, an alternative evaluation system leveraging synthesized programs to overcome the limitations of directly using LLMs as judges. This approach introduces program-generated judging logic to offer more consistent and cost-effective evaluations while addressing inherent biases associated with LLM-as-a-judge systems.

LLM-as-a-Judge Limitations and PAJAMA's Proposition

LLM-as-a-judge systems are historically reliant on utilizing state-of-the-art LLMs to monitor generative models’ outputs, which poses several challenges such as prohibitive cost, unreliable outputs, static evaluation pipelines, and embedded biases resulting from the models’ training datasets. To alleviate these challenges, PAJAMA proposes synthesizing judging programs using LLM responses. These synthesized programs are executable, interpretable pieces of code that can be stored and reused locally, drastically minimizing API costs. Furthermore, PAJAMA's programs offer transparency by encoding clear judging criteria, which can be audited and refined to decrease biases.

Experiments and Findings

Experiments conducted exhibit compelling results favoring PAJAMA's methodology. By applying synthesized judging programs and employing a weak supervision framework for aggregation, PAJAMA reported notable improvements in reducing bias, cost efficiency, and consistency relative to the LLM-as-a-judge systems.

  1. Evaluation Efficacy: PAJAMA demonstrates significant reductions in cost while achieving competitive performance with respect to preference label accuracy on datasets such as Prometheus, JudgeLM, and PandaLM. Moreover, PAJAMA’s reward models outperform LLM-as-a-judge models in challenging evaluation subsets like CHAT-HARD.
  2. Bias Mitigation: PAJAMA improves the consistency of evaluations and minimizes biased response selection across various bias types. Compared to LLM-as-a-judge models like Qwen2.5-14B, PAJAMA increases consistency by 15.83% and reduces the biased-answer win rate by 23.7%.

Technical Components

Central to PAJAMA are synthesized Python programs containing specific evaluation criteria. Through criteria such as semantic similarity, lexical diversity, readability metrics, and bias detection, the authors create diverse judging logic capable of efficient, scalable assessment of LLM outputs. Each synthesized program functions independently and can be audited for biases or inaccuracies—an advancement in LLM-evaluation transparency.

The programs are further combined using weak supervision techniques, where outputs from multiple synthesized judges are aggregated to form a robust collective decision signal. This ensemble approach amplifies the efficacy of the evaluation system by integrating complementary judging signals beyond simple majority voting.

Practical and Theoretical Implications

PAJAMA provides a promising alternative to costly LLM-as-a-judge paradigms, addressing not only financial but also reliability and fairness concerns inherent in traditional LLM evaluation systems. By synthesizing executable programs as judges, practitioners gain flexibility and transparency in LLM assessments, ensuring better alignment with domain-specific criteria. This allows more precise tweaking and improvement over time, potentially setting a new standard for evaluation practices.

Future Directions

The success of PAJAMA marks potential future directions in automated LLM evaluation, notably in effective program synthesis methodologies, richer criteria formulation, and enhanced aggregation frameworks. Developing more sophisticated program generation methods could further improve precision and encourage broader adoption across domains requiring nuanced judgments.

This paper lays foundational work for widening the scope of synthetic judging programs within intelligent systems evaluation, suggesting a pathway toward more scalable, interpretable, and bias-resilient automated assessments in AI models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube