Do Large Language Models Have Number Sense?
This presentation examines whether large language models possess genuine number sense—the flexible, structure-sensitive reasoning about numbers emphasized in mathematics education—or merely achieve high arithmetic accuracy through procedural fluency. Using the SenseMath benchmark, researchers tested models across three dimensions: executing mathematical shortcuts, judging when shortcuts apply, and generating problems that admit shortcuts. The findings reveal a striking gap: models can apply shortcuts when explicitly prompted but systematically fail to judge applicability and rarely generate structurally valid problems, demonstrating procedural competence without conceptual understanding.Script
Large language models now solve arithmetic problems at near-human accuracy, but here's the catch: high performance doesn't guarantee mathematical understanding. When a student sees 100 minus 2, times 37, do they laboriously multiply, or do they recognize the structure and compute 98 times 37 more efficiently? This paper asks whether models possess that kind of flexible number sense, or just brute-force calculation.
To probe this question rigorously, the researchers built SenseMath, a benchmark designed around matched problem variants. Each item appears in three forms that look similar on the surface but differ in whether a shortcut genuinely applies. This controlled design lets them measure whether models select strategies based on mathematical structure or just surface patterns, and it tests capability at three levels of cognitive demand.
First, can models actually use shortcuts when the structure is there?
Under standard chain-of-thought prompting, models rarely exploit shortcuts, even when those shortcuts would drastically simplify the problem. But when given an explicit prompt emphasizing mathematical intuition, shortcut usage jumps from under 40 percent to over 80 percent for the strongest models, with accuracy gains up to 15 percentage points. The capability is latent, not absent. Smaller 8 billion parameter models, however, show no such benefit, suggesting a threshold in model scale or training for acquiring these strategies. The effect strengthens as digit scale increases, exactly when shortcuts matter most.
The dissociation becomes stark when models must judge or create rather than just execute. Asked whether a shortcut applies to a new problem, models overwhelmingly overgeneralize, saying yes regardless of actual structure. They can identify a shortcut in a worked solution, but cannot prospectively evaluate applicability. Generation is even harder: models produce superficially plausible items with round numbers and proper formatting, but almost never construct problems where a shortcut genuinely works. They mimic the surface without grasping the underlying mathematical relationships.
These results trace a hierarchy: Apply is easier than Analyze, which is easier than Create. This gradient mirrors developmental psychology, where children master procedures before understanding when and why those procedures work. For large language models deployed as math tutors or problem generators, this gap is not academic. A system that confidently applies shortcuts in the wrong context, or generates practice problems that look right but lack valid structure, undermines trust and learning.
High arithmetic accuracy, it turns out, can coexist with shallow number sense. Models have learned the moves, but not yet the game. To learn more about this research and create your own videos exploring the frontiers of AI, visit EmergentMind.com.