Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks (2502.12896v4)

Published 18 Feb 2025 in cs.CL

Abstract: In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs' answers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Eva Sánchez Salido (2 papers)
  2. Julio Gonzalo (11 papers)
  3. Guillermo Marco (5 papers)

Summary

  • The paper introduces a novel evaluation technique that replaces the correct answer with 'None of the others' to distinguish genuine reasoning from memorization.
  • It demonstrates significant performance drops (10% to 93%) in LLMs, highlighting a reliance on memorized cues over reasoning.
  • The study uses multiple datasets to reveal contamination effects and shows that improved reasoning is not strictly tied to model size.

Evaluating Reasoning vs. Memorization in LLMs: Insights from Anchor Questions

The paper "None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks" presents an innovative methodology to assess the reasoning capabilities of LLMs in the context of multiple-choice question answering. The research addresses a critical aspect of AI evaluation by attempting to differentiate true reasoning from mere memorization, a challenge that persists despite the acclaimed advances of state-of-the-art LLMs.

Methodological Framework

The authors propose a novel evaluation technique where they substitute the correct answer in multiple-choice questions with "None of the other answers," prompting models to engage in genuine reasoning. This alteration increases the complexity of the task by removing direct lexical or semantic cues that models might have memorized from training data. Importantly, this method requires models to discard all incorrect options to derive at "None of the others" as the correct choice, thus emphasizing reasoning over surface-level memorization.

Experimental Setup

The efficacy of this technique is validated using two datasets: the MMLU benchmark and the private UNED-Access 2024 dataset. The models evaluated include both proprietary systems such as OpenAI-o3-mini, Claude-3.5-Sonnet, and open-source models like Llama-3 variants. The paper employs metrics such as Cohen's Kappa alongside accuracy measures to account for chance-level performance, given the variability in the number of choices per question.

Key Findings

  1. Reasoning vs. Memorization: When evaluated under the "none of the others" condition, all models experienced significant decreases in performance, with accuracy drops ranging from 10% to 93%. This stark decline underscores the reliance on memorization in existing LLMs and confirms that memorization alone is inadequate when models are faced with the removal of lexical cues.
  2. Contamination Effects: The paper highlights contamination as a factor influencing model performance. Datasets like MMLU, which are publicly available, showed greater performance drops as they are prone to pre-training data exposure. Meanwhile, the private UNED-Access 2024 dataset exhibited marginally better resilience, pointing to the contaminating effect of exposure to training data.
  3. Translation Biases: The paper evaluated bilingual datasets and found that translated questions, unlikely to appear verbatim in English pre-training data, led to varied performance drops. This suggests that LLMs' accuracy might be inflated when handling tasks in the original language of the training corpus compared to translations.
  4. Robustness and Model Size: Interestingly, robustness to the "none of the others" alteration was not strictly correlated with model size. Some mid-sized models, like DeepSeek-R1-70B, demonstrated lower performance drops compared to larger models, indicating that reasoning abilities may depend on factors other than scale.

Implications and Future Directions

These findings have significant implications for the development and evaluation of LLMs. The revelation that existing benchmarks might overestimate reasoning abilities due to memorization signals a need for revised evaluation criteria. Anchor questions like "none of the others" could serve as a crucial tool for exposing reasoning deficiencies and prompting the design of LLMs capable of deeper cognitive processing.

Future research directions may focus on refining these evaluation techniques and integrating them into broader assessment frameworks to ensure comprehensive measurement of LLM reasoning capabilities. Such efforts would align with advancing AI technologies that transcend rote memorization and exhibit authentic understanding and inference capacities. This work advocates for a paradigm shift in AI evaluation, emphasizing sophisticated reasoning metrics to drive the next wave of AI innovation.