Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning (2410.10854v1)

Published 6 Oct 2024 in cs.CL and cs.AI

Abstract: Questions involving commonsense reasoning about everyday situations often admit many $\textit{possible}$ or $\textit{plausible}$ answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the $\textit{most}$ plausible answer choice. On $250$ MCQ items sampled from two commonsense reasoning benchmarks, we collect $5,000$ independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

Citations (1)

View on Semantic Scholar

Collections

Summary

The paper shows a significant divergence (over 20%) between benchmark gold answers and human plausibility ratings, highlighting design ambiguities.
The paper uses qualitative analysis to expose ambiguous wording and semantic mismatches in MCQs that affect model evaluations.
The paper demonstrates that LLMs like GPT-4 and LLaMA-2 experience performance drops on problematic questions, urging more refined evaluation metrics.

Analysis of Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

In the paper "Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning," the researchers analyze the design and evaluation gaps in multiple-choice question (MCQ) benchmarks for commonsense reasoning. The paper scrutinizes whether the benchmark gold answers truly align with human judgments on plausibility.

Core Findings

Divergence in Gold Answers and Human Plausibility:
- The research indicates a significant divergence (over 20%) between benchmark gold answers and those deemed most plausible by human annotators in the sample. This misalignment suggests that some questions may inherently possess ambiguity or semantic mismatch.
Qualitative Analysis:
- An insightful qualitative examination reveals prevalent issues such as ambiguous questions, semantic mismatches, and incoherent choices. These are crucial factors that can impede the effective evaluation of models.
LLM Performance:
- Experiments with various LLMs, including GPT-4 and LLaMA-2, highlight a notable performance drop and variation when faced with these problematic questions. The paper suggests these items introduce noise in evaluations, thus affecting the reliability of the performance metrics.
Human Vs. Model Performance:
- Despite the complexity of some questions, human annotators often matched the gold label in non-problematic questions, unlike LLMs, which displayed variance in the problematic subsets.

Implications

Evaluation: The findings highlight the necessity to redefine evaluation metrics to account for questions with multiple plausible interpretations. This could improve the alignment between human plausibility judgments and benchmark gold answers.
Dataset Construction: A recommendation arises for dataset creators to meticulously evaluate the potential for multiple valid interpretations within the questions. Incorporating a more diverse set of annotations and possibly an "inapplicable" option can enhance question clarity and relevance.
Performance Metrics: The disparity in LLMs' handling of problematic questions underscores the need for developing models with a stronger capability of capturing nuanced commonsense reasoning.

Future Directions

The paper indicates future research should consider integrating plausibility ratings into the dataset creation process. Exploring culturally contextual evaluations might also advance the robustness of commonsense reasoning assessments, acknowledging that plausible interpretations can significantly vary across different populations.

Overall, this analysis provides a comprehensive critique of existing commonsense reasoning benchmarks and posits actionable insights for improving the design and evaluation of future datasets. As AI continues to evolve, addressing these identified limitations is paramount for advancing the field of natural language understanding.

PDF Markdown

Paper Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (6)

Tweets

https://twitter.com/PaltaShramay/status/1846676759711809967