EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria (2309.13633v2)
Abstract: By simply composing prompts, developers can prototype novel generative applications with LLMs. To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvaLLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvaLLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.
- RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. arXiv preprint arXiv:2305.08844 (2023).
- PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403 (2023).
- Out of one, many: Using language models to simulate human samples. Political Analysis 31, 3 (2023), 337–351.
- Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing. ACM Trans. Comput.-Hum. Interact. (apr 2023). https://doi.org/10.1145/3589955 Just Accepted.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
- Benchmarking Foundation Models with Language-Model-as-an-Examiner. arXiv preprint arXiv:2306.04181 (2023).
- Best practices for developing and validating scales for health, social, and behavioral research: a primer. Frontiers in public health 6 (2018), 149.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- Discovering and validating ai errors with crowdsourced failure reports. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–22.
- Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 419, 14 pages. https://doi.org/10.1145/3544548.3581268
- Cheng-Han Chiang and Hung yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? arXiv:2305.01937 [cs.CL]
- TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 209, 19 pages. https://doi.org/10.1145/3491102.3501819
- All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7282–7296. https://doi.org/10.18653/v1/2021.acl-long.565
- ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. arXiv:2306.16092 [cs.CL]
- LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. arXiv:2306.12420 [cs.CL]
- Is GPT-3 text indistinguishable from human text? SCARECROW: A framework for scrutinizing machine text. arXiv preprint arXiv:2107.01294 (2021).
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]
- Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409.
- Hierarchical neural story generation. arXiv preprint arXiv:1805.04833 (2018).
- Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. arXiv:2305.00955 [cs.CL]
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research 77 (2023), 103–166.
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. arXiv:2308.11995 [cs.CL]
- Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances in Psychology, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9
- LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL]
- PromptMaker: Prompt-Based Prototyping with Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 8 pages. https://doi.org/10.1145/3491101.3503564
- GENIE: Toward reproducible and standardized human evaluation for text generation. arXiv preprint arXiv:2101.06561 (2021).
- Aligning Large Language Models through Synthetic Feedback. arXiv:2305.13735 [cs.CL]
- Cells, Generators, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. In The 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, 18 pages. https://doi.org/979-8-4007-0132-0/23/10
- Jon Kleinberg and Manish Raghavan. 2021. Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences 118, 22 (2021), e2018340118.
- LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. arXiv preprint arXiv:2301.13298 (2023).
- DAPIE: Interactive Step-by-Step Explanatory Dialogues to Answer Children’s Why and How Questions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 450, 22 pages. https://doi.org/10.1145/3544548.3581369
- ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-Turn Comparisons. arXiv preprint arXiv:1909.03087 (2019).
- PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. arXiv preprint arXiv:2307.02762 (2023).
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- What Makes Good In-Context Examples for GPT-3333? arXiv preprint arXiv:2101.06804 (2021).
- “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 598, 31 pages. https://doi.org/10.1145/3544548.3580817
- Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL]
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
- PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.
- PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models. arXiv preprint arXiv:2304.01964 (2023).
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023 (2016).
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
- BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 74, 18 pages. https://doi.org/10.1145/3526113.3545616
- AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 225, 16 pages. https://doi.org/10.1145/3544548.3580907
- Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive testing and debugging of NLP models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3253–3267.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118 (2020).
- Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 832, 20 pages. https://doi.org/10.1145/3544548.3580790
- Mark A. Robinson. 2018. Using multi-item psychometric scales for research and practice in human resource management. Human Resource Management 57, 3 (2018), 739–750. https://doi.org/10.1002/hrm.21852 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/hrm.21852
- Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633 (2021).
- Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548 (2023).
- Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2023), 1146–1156. https://doi.org/10.1109/TVCG.2022.3209479
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. arXiv:2305.03047 [cs.LG]
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
- Shepherd: A Critic for Language Model Generation. arXiv:2308.04592 [cs.CL]
- Leveraging Large Language Models to Power Chatbots for Collecting User Self-Reported Data. arXiv:2301.05843 [cs.HC]
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Transactions on Visualization and Computer Graphics 26, 1 (2020), 56–65. https://doi.org/10.1109/TVCG.2019.2934619
- ScatterShot: Interactive In-Context Example Curation for Text Transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces (Sydney, NSW, Australia) (IUI ’23). Association for Computing Machinery, New York, NY, USA, 353–367. https://doi.org/10.1145/3581641.3584059
- Errudite: Scalable, Reproducible, and Testable Error Analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 747–763. https://doi.org/10.18653/v1/P19-1073
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. arXiv preprint arXiv:2101.00288 (2021).
- AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 385, 22 pages. https://doi.org/10.1145/3491102.3517582
- Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective. arXiv:2305.14889 [cs.CL]
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets. arXiv preprint arXiv:2307.10928 (2023).
- Herding AI Cats: Lessons from Designing a Chatbot by Prompting GPT-3. In Proceedings of the 2023 ACM Designing Interactive Systems Conference (Pittsburgh, PA, USA) (DIS ’23). Association for Computing Machinery, New York, NY, USA, 2206–2220. https://doi.org/10.1145/3563657.3596138
- Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. https://doi.org/10.1145/3544548.3581388
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv:2303.16199 [cs.CV]
- Calibrate Before Use: Improving Few-shot Performance of Language Models. In The 38th International Conference on Machine Learning (ICML ’21). 12697–12706. http://proceedings.mlr.press/v139/zhao21c.html
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL]
- Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197 (2022).
- Tae Soo Kim (20 papers)
- Yoonjoo Lee (8 papers)
- Jamin Shin (24 papers)
- Young-Ho Kim (36 papers)
- Juho Kim (56 papers)