Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence (2402.11161v5)

Published 17 Feb 2024 in cs.CL and cs.AI

Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from LLMs. There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore).

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zongxia Li (14 papers)
  2. Ishani Mondal (23 papers)
  3. Yijun Liang (5 papers)
  4. Huy Nghiem (9 papers)
  5. Jordan Lee Boyd-Graber (13 papers)
Citations (8)