Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Split and Merge: Aligning Position Biases in LLM-based Evaluators (2310.01432v3)

Published 29 Sep 2023 in cs.CL and cs.AI

Abstract: LLMs have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, these LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. Remarkably, PORTIA enables less advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4 model at just 10% of the cost. Furthermore, it rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with human evaluators. These findings highlight PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost-efficiency. This represents a valuable step toward a more reliable and scalable use of LLMs for automated evaluations across diverse applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. claude2. https://www.anthropic.com/index/claude-2.
  2. qwen. https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md.
  3. treesitter. https://tree-sitter.github.io/tree-sitter/.
  4. wormgpt. https://wormgpt.ai/.
  5. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
  6. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023.
  7. Reducing the carbon impact of generative ai inference (today and in 2035). In Proceedings of the 2nd Workshop on Sustainable Computer Systems, pp.  1–7, 2023.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019. URL https://aclanthology.org/N19-1423.
  9. Cradle: Deep code retrieval based on semantic dependency learning. Neural Networks, 141:385–394, 2021.
  10. Are large language model-based evaluators the solution to scaling up multilingual evaluation?, 2023.
  11. Reading rate and retention as a function of the number of propositions in the base structure of sentences. Cognitive Psychology, 5(3):257–274, 1973. ISSN 0010-0285. doi: https://doi.org/10.1016/0010-0285(73)90036-4. URL https://www.sciencedirect.com/science/article/pii/0010028573900364.
  12. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  13. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
  14. Unleashing the power of compiler intermediate representation to enhance neural program embeddings. In Proceedings of the 44th International Conference on Software Engineering, pp.  2253–2265, 2022a.
  15. Cctest: Testing and repairing code completion systems. arXiv preprint arXiv:2208.08289, 2022b.
  16. On the feasibility of specialized ability stealing for large language code models. arXiv preprint arXiv:2303.03012, 2023b.
  17. Protecting intellectual property of large language model-based code generation apis via watermarks. In Proceedings of the 30th ACM Conference on Computer and Communication Security (ACM CCS 2023), 2023c.
  18. Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
  19. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  20. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pp.  47–58, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlp4convai-1.5. URL https://aclanthology.org/2023.nlp4convai-1.5.
  21. ” oops, did i just say that?” testing and repairing unethical suggestions of large language models with suggest-critique-reflect process. arXiv preprint arXiv:2305.02626, 2023.
  22. OpenAI. Gpt-4 technical report, 2023.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  24. Validity problems comparing values across cultures and possible solutions. Psychological methods, 2(4):329, 1997.
  25. Can foundation models label data like humans? Hugging Face Blog, 2023. https://huggingface.co/blog/llm-v-human-data.
  26. Oktavia Yovi Ratnasari. Students’difficulties in reading comprehension and the strategies to deal with the difficulties. Jurnal Penelitian, Pendidikan, dan Pembelajaran, 18(13), 2023.
  27. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
  28. From humans to machines: can chatgpt-like llms effectively replace human annotators in nlp tasks. In Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media, 2023.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  30. Enriching query semantics for code search with reinforcement learning. Neural Networks, 145:22–32, 2022.
  31. Reef: A framework for collecting real-world vulnerabilities and fixes. arXiv preprint arXiv:2309.08115, 2023a.
  32. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023b.
  33. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087, 2023c.
  34. Analogical-a novel benchmark for long text analogy evaluation in large language models. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  3534–3549, 2023.
  35. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025, 2023.
  36. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  37. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  38. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023.
  39. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp. 12697–12706. PMLR, 2021.
  40. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  41. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zongjie Li (29 papers)
  2. Chaozheng Wang (28 papers)
  3. Pingchuan Ma (90 papers)
  4. Daoyuan Wu (39 papers)
  5. Shuai Wang (466 papers)
  6. Cuiyun Gao (97 papers)
  7. Yang Liu (2253 papers)
Citations (30)
X Twitter Logo Streamline Icon: https://streamlinehq.com