Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Open-QA Evaluation (2305.12421v4)

Published 21 May 2023 in cs.CL and cs.AI

Abstract: This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of LLMs. Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods and methods to improve LLM-based evaluators. We believe this new QA-Eval task and corresponding dataset EVOUNA will facilitate the development of more effective automatic evaluation tools and prove valuable for future research in this area. All resources are available at \url{https://github.com/wangcunxiang/QA-Eval} and it is under the Apache-2.0 License.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Cunxiang Wang (31 papers)
  2. Sirui Cheng (3 papers)
  3. Qipeng Guo (72 papers)
  4. Yuanhao Yue (9 papers)
  5. Bowen Ding (12 papers)
  6. Zhikun Xu (15 papers)
  7. Yidong Wang (43 papers)
  8. Xiangkun Hu (19 papers)
  9. Zheng Zhang (488 papers)
  10. Yue Zhang (620 papers)
Citations (21)
Github Logo Streamline Icon: https://streamlinehq.com