Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human-Centered Design Recommendations for LLM-as-a-Judge (2407.03479v1)

Published 3 Jul 2024 in cs.HC

Abstract: Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from LLMs that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners' preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Zeno: An interactive framework for behavioral evaluation of machine learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–14.
  2. On training instance selection for few-shot neural text generation. arXiv preprint arXiv:2107.03176.
  3. Evalullm: Llm assisted evaluation of generative outputs. In Companion Proceedings of the 29th International Conference on Intelligent User Interfaces, IUI ’24 Companion, page 30–32, New York, NY, USA. Association for Computing Machinery.
  4. Semi-automated data labeling. In NeurIPS 2020 Competition and Demonstration Track, pages 156–169. PMLR.
  5. Results of wmt22 metrics shared task: Stop using bleu–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68.
  6. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  7. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77:103–166.
  8. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  9. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633.
  10. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. In arXiv preprint arXiv:2309.13633.
  11. Alpacaeval: An automatic evaluator of instruction-following models. GitHub repository.
  12. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  13. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  14. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  15. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308.
  16. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  17. Constitutionmaker: Interactively critiquing large language models by converting feedback into principles. arXiv preprint arXiv:2310.15428.
  18. Keith E. Stanovich and Richard F. West. 2000. Advancing the rationality debate. Behavioral and Brain Sciences, 23(5):701–717.
  19. Interface design for crowdsourcing hierarchical multi-label text annotations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17.
  20. Selective annotation makes language models better few-shot learners. Preprint, arXiv:2209.01975.
  21. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  22. Self-consistency improves chain of thought reasoning in language models. Preprint, arXiv:2203.11171.
  23. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  24. The what-if tool: Interactive probing of machine learning models. IEEE transactions on visualization and computer graphics, 26(1):56–65.
  25. Errudite: Scalable, reproducible, and testable error analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 747–763.
  26. Evaluating nlg evaluation metrics: A measurement theory perspective. arXiv preprint arXiv:2305.14889.
  27. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  28. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Qian Pan (9 papers)
  2. Zahra Ashktorab (15 papers)
  3. Michael Desmond (10 papers)
  4. Martin Santillan Cooper (6 papers)
  5. James Johnson (8 papers)
  6. Rahul Nair (26 papers)
  7. Elizabeth Daly (16 papers)
  8. Werner Geyer (20 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com