LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation (2503.02972v5)
Abstract: The expanding knowledge and memorisation capacity of frontier LLMs allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models' knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that LLMs' (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models' internalised knowledge when developing reasoning benchmarks.
- Jude Khouja (3 papers)
- Karolina Korgul (2 papers)
- Simi Hellsten (2 papers)
- Lingyi Yang (8 papers)
- Harry Mayne (6 papers)
- Ryan Kearns (3 papers)
- Andrew Bean (3 papers)
- Adam Mahdi (27 papers)
- Vlad Neacsu (1 paper)