Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation (2503.02972v5)

Published 4 Mar 2025 in cs.CL and cs.AI

Abstract: The expanding knowledge and memorisation capacity of frontier LLMs allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models' knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that LLMs' (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models' internalised knowledge when developing reasoning benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jude Khouja (3 papers)
  2. Karolina Korgul (2 papers)
  3. Simi Hellsten (2 papers)
  4. Lingyi Yang (8 papers)
  5. Harry Mayne (6 papers)
  6. Ryan Kearns (3 papers)
  7. Andrew Bean (3 papers)
  8. Adam Mahdi (27 papers)
  9. Vlad Neacsu (1 paper)

Summary

Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation

The evaluation of reasoning capabilities in LLMs is often compromised by the inadvertent inclusion of familiar data during challenges. The paper "Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation" introduces a novel framework to address this challenge, focusing on the need to accurately assess reasoning in LLMs. Using linguistic reasoning problems, the paper develops a benchmark named \ourname*, which employs orthographic templatisation and obfuscation to mitigate memorisation effects in LLM reasoning assessment.

The traditional benchmarks evaluating reasoning often fall short due to data contamination; familiar data subsets in training can inflate perceived reasoning abilities. To address this, the authors propose creating linguistic problems that are obfuscated orthographically, limiting data exposure instances. The \ourname* benchmark comprises 27,325 questions. The core obfuscation strategy involves creating new orthographies which retain the linguistic logic but alternate graphemes systematically. This effectively generates varied question instances, ensuring that models cannot leverage memorised knowledge from training data.

Experimentation with eleven state-of-the-art LLMs, including Claude 3.7 Sonnet, GPT-4o, and DeepSeek R1, revealed evident struggles in addressing these novel obfuscations, despite existing prowess in other reasoning tasks. Notably, a performance reduction on obfuscated problems was consistent across models. This validated the hypothesis that memorisation effects inflated previous assessments. Variations in model accuracy across permutations indicated a lack of robust reasoning capabilities, indicating that current LLMs, despite their advancements, exhibit notable limitations.

The implications of establishing \ourname* are twofold. Practically, it provides an effective measure to separate genuine reasoning capability from memorisation, crucial for improving LLM deployment in applications demanding critical thinking. Theoretically, it advances understanding of LLM reasoning processes, elucidating dependencies on training data rather than a capacity for genuine deduction. Future developments in AI could focus on enhancing models' ability to adapt and apply fundamental reasoning principles dynamically, irrespective of data familiarity.

Looking forward, this research underscores the necessity for ongoing assessment framework evolution to accommodate the shifting landscape of LLM capabilities. As models become increasingly ingrained in analytical roles across sectors, the demand for authentic, context-independent reasoning will grow. Rigorous benchmarking, as embodied by \ourname*, will be pivotal in driving advancements toward fully autonomous reasoning systems.