Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness (2406.16342v2)

Published 24 Jun 2024 in cs.CL

Abstract: Adversarial datasets should ensure AI robustness that matches human performance. However, as models evolve, datasets can become obsolete. Thus, adversarial datasets should be periodically updated based on their degradation in adversarialness. Given the lack of a standardized metric for measuring adversarialness, we propose AdvScore, a human-grounded evaluation metric. AdvScore assesses a dataset's true adversarialness by capturing models' and humans' varying abilities, while also identifying poor examples. AdvScore then motivates a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten LLM predictions to track the models' improvement over five years (from 2020 to 2024). AdvScore assesses whether adversarial datasets remain suitable for model evaluation, measures model improvements, and provides guidance for better alignment with human capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yoo Yeon Sung (7 papers)
  2. Eve Fleisig (14 papers)
  3. Ishani Mondal (23 papers)
  4. Jordan Lee Boyd-Graber (13 papers)
  5. Maharshi Gor (9 papers)