Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? (2410.15512v1)

Published 20 Oct 2024 in cs.CL
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

Abstract: Question answering (QA)-producing correct answers for input questions-is popular, but we test a reverse question answering (RQA) task: given an input answer, generate a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and assessing reasoning consistency. 16 LLMs run QA and RQA with trivia questions/answers, showing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types yielding RQA errors, we suggest improvements for LLM RQA reasoning.

Insights on Reverse Question Answering: An Analysis

The paper "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?" explores the intriguing challenge of Reverse Question Answering (RQA), evaluating whether LLMs can generate a valid question from a given answer. This paper distinguishes RQA from traditional Question Answering (QA) tasks by emphasizing reasoning strategies and consistency across various answer types, thus shedding light on the abduction and deduction reasoning capabilities of LLMs.

Key Contributions

The authors explore 16 LLMs across four distinct answer domains: numerical answers (Numbers and Number+Text) and textual answers (Easy and Hard Facts). A comprehensive dataset of 3443 trivia question-answer pairs is utilized to test both QA and RQA tasks, employing a carefully designed evaluation setup. A central focus of the paper is to uncover disparities in LLM performance between these tasks and identify underlying issues that influence the model's ability to reason and self-verify.

Findings

  1. Performance Disparities:
    • Numerical Answers: LLMs exhibit significantly lower accuracy in RQA compared to QA when the answers are numerical, with performance discrepancies exceeding 0.80 for some models. This clearly highlights a substantial limitation in abductive reasoning capabilities for numerical data.
    • Textual Answers: Conversely, models perform slightly better in RQA than QA for textual answers, suggesting a domain-specific advantage in generating questions from known entities.
  2. Logical Consistency:
    • Upon analyzing the consistency of results when chaining RQA and QA tasks, a trend emerges in which models frequently identify and answer their own invalid RQA-generated questions correctly in QA. This indicates that many RQA failures are not purely due to knowledge deficiencies but could stem from improper question formulation.
  3. Error Correlation:
    • The paper finds correlations between RQA errors and factors like question difficulty and answer frequency. Particularly, RQA mistakes in Number+Text categories arise more with rare entities, while overly complex multi-hop questions often lead to errors in the Numbers domain.

Implications and Future Work

The implications of these findings are twofold: from a theoretical perspective, the paper challenges assumptions about LLM reasoning capabilities, particularly highlighting weaknesses in numerical domains. Practically, it advises enhancing LLMs through calibrated model tuning and training data modifications that mitigate bias towards generating complex and inaccurate questions.

For future research, the analysis suggests creating more abductive reasoning benchmarks and improving self-verification mechanisms to augment LLM robustness in generating valid, answer-verifiable questions. This dual focus on reasoning and consistency promises to enhance the utility of LLMs in educational, brainstorming, and exam generation contexts.

In conclusion, this paper provides a critical examination of the RQA task, offering insights into the domain-specific abilities and limitations of current LLMs. By pinpointing areas of weakness, it sets the stage for future advancements in the logical coherence and reasoning processes of AI LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nishant Balepur (14 papers)
  2. Feng Gu (29 papers)
  3. Abhilasha Ravichander (33 papers)
  4. Shi Feng (95 papers)
  5. Jordan Boyd-Graber (68 papers)
  6. Rachel Rudinger (46 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com