Emergent Mind

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

(2309.13633)
Published Sep 24, 2023 in cs.HC , cs.AI , and cs.CL

Abstract

By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
  2. PaLM 2 Technical Report
  3. Out of one, many: Using language models to simulate human samples. Political Analysis 31, 3 (2023), 337–351.
  4. Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing. ACM Trans. Comput.-Hum. Interact. (apr 2023). https://doi.org/10.1145/3589955 Just Accepted.

  5. Constitutional AI: Harmlessness from AI Feedback
  6. Benchmarking Foundation Models with Language-Model-as-an-Examiner
  7. Best practices for developing and validating scales for health, social, and behavioral research: a primer. Frontiers in public health 6 (2018), 149.
  8. On the Opportunities and Risks of Foundation Models
  9. Language Models are Few-Shot Learners
  10. Discovering and validating ai errors with crowdsourced failure reports. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–22.
  11. Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 419, 14 pages. https://doi.org/10.1145/3544548.3581268
  12. Can Large Language Models Be an Alternative to Human Evaluations?
  13. TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 209, 19 pages. https://doi.org/10.1145/3491102.3501819
  14. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7282–7296. https://doi.org/10.18653/v1/2021.acl-long.565

  15. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases
  16. LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models
  17. Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text
  18. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  19. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409.
  20. Hierarchical Neural Story Generation
  21. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
  22. GPTScore: Evaluate as You Desire
  23. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research 77 (2023), 103–166.
  24. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
  25. Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances in Psychology, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9

  26. LoRA: Low-Rank Adaptation of Large Language Models
  27. PromptMaker: Prompt-Based Prototyping with Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 8 pages. https://doi.org/10.1145/3491101.3503564
  28. GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation
  29. Aligning Large Language Models through Synthetic Feedback
  30. Cells, Generators, and Lenses: Design Framework for Object-Oriented Interaction with LLMs. In The 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, 18 pages. https://doi.org/979-8-4007-0132-0/23/10

  31. Jon Kleinberg and Manish Raghavan. 2021. Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences 118, 22 (2021), e2018340118.
  32. LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization
  33. DAPIE: Interactive Step-by-Step Explanatory Dialogues to Answer Children’s Why and How Questions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 450, 22 pages. https://doi.org/10.1145/3544548.3581369
  34. ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons
  35. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
  36. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

  37. What Makes Good In-Context Examples for GPT-$3$?
  38. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 598, 31 pages. https://doi.org/10.1145/3544548.3580817
  39. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815

  40. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  41. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
  42. Self-Refine: Iterative Refinement with Self-Feedback
  43. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.

  44. PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models
  45. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
  46. GPT-4 Technical Report
  47. Training language models to follow instructions with human feedback
  48. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  49. Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 74, 18 pages. https://doi.org/10.1145/3526113.3545616
  50. AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 225, 16 pages. https://doi.org/10.1145/3544548.3580907
  51. Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive testing and debugging of NLP models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3253–3267.
  52. Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  53. Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 832, 20 pages. https://doi.org/10.1145/3544548.3580790
  54. Mark A. Robinson. 2018. Using multi-item psychometric scales for research and practice in human resource management. Human Resource Management 57, 3 (2018), 739–750. https://doi.org/10.1002/hrm.21852 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/hrm.21852

  55. Learning To Retrieve Prompts for In-Context Learning
  56. Whose Opinions Do Language Models Reflect?
  57. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2023), 1146–1156. https://doi.org/10.1109/TVCG.2022.3209479
  58. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
  59. LLaMA: Open and Efficient Foundation Language Models
  60. Llama 2: Open Foundation and Fine-Tuned Chat Models
  61. Large Language Models are not Fair Evaluators
  62. Shepherd: A Critic for Language Model Generation
  63. Leveraging Large Language Models to Power Chatbots for Collecting User Self-Reported Data
  64. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  65. The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Transactions on Visualization and Computer Graphics 26, 1 (2020), 56–65. https://doi.org/10.1109/TVCG.2019.2934619
  66. ScatterShot: Interactive In-Context Example Curation for Text Transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces (Sydney, NSW, Australia) (IUI ’23). Association for Computing Machinery, New York, NY, USA, 353–367. https://doi.org/10.1145/3581641.3584059
  67. Errudite: Scalable, Reproducible, and Testable Error Analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 747–763. https://doi.org/10.18653/v1/P19-1073

  68. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models
  69. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 385, 22 pages. https://doi.org/10.1145/3491102.3517582
  70. Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
  71. FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
  72. Herding AI Cats: Lessons from Designing a Chatbot by Prompting GPT-3. In Proceedings of the 2023 ACM Designing Interactive Systems Conference (Pittsburgh, PA, USA) (DIS ’23). Association for Computing Machinery, New York, NY, USA, 2206–2220. https://doi.org/10.1145/3563657.3596138
  73. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. https://doi.org/10.1145/3544548.3581388
  74. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
  75. Calibrate Before Use: Improving Few-shot Performance of Language Models. In The 38th International Conference on Machine Learning (ICML ’21). 12697–12706. http://proceedings.mlr.press/v139/zhao21c.html

  76. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  77. Towards a Unified Multi-Dimensional Evaluator for Text Generation

Show All 77

Test Your Knowledge

You answered out of questions correctly.

Well done!