- The paper presents the OMEGA benchmark to systematically evaluate LLMs' out-of-distribution generalization in mathematical reasoning.
- It utilizes 40 programmatically generated templates across six math domains with rigorous symbolic, numerical, and graphical solution verification.
- Empirical findings reveal sharp accuracy declines with complexity and limited reinforcement learning gains on compositional and transformative generalization tasks.
OMEGA: Evaluating the Generalization Limits of LLMs in Mathematical Reasoning
The OMEGA benchmark provides a systematic framework for evaluating the out-of-distribution (OOD) generalization capabilities of LLMs in mathematical reasoning. The benchmark is motivated by the observation that, despite recent advances, state-of-the-art LLMs often rely on a narrow set of familiar strategies and struggle with problems requiring novel or creative approaches. OMEGA is designed to probe three axes of generalization—exploratory, compositional, and transformative—each corresponding to a distinct aspect of mathematical creativity and flexibility.
Benchmark Design and Methodology
OMEGA is constructed from 40 programmatically generated problem templates spanning six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic/puzzles. Each template is parameterized to allow precise control over problem complexity and the reasoning strategies required. Solutions are verified using symbolic, numerical, or graphical methods, ensuring correctness and scalability.
The three axes of generalization are defined as follows:
- Exploratory Generalization: Tests whether models can extend a single reasoning strategy to more complex instances within the same problem family.
- Compositional Generalization: Assesses the ability to integrate multiple reasoning skills, previously learned in isolation, to solve novel problems requiring their combination.
- Transformative Generalization: Evaluates whether models can abandon familiar but ineffective strategies in favor of qualitatively different, more efficient approaches.
For each axis, OMEGA provides matched training and test sets, enabling controlled studies of generalization behavior. The benchmark is calibrated to Olympiad-level complexity, with many problems serving as sub-components of advanced mathematical tasks.
Empirical Findings
Across all evaluated models (DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini, OpenAI-o4-mini), exact-match accuracy declines sharply as problem complexity increases. This degradation is not attributable to context length limitations; rather, it reflects the increased cognitive load and the compounding of errors in longer reasoning chains. Chain-of-Thought (CoT) analysis reveals that:
- Models often discover correct solutions early but expend excessive tokens on verification, leading to inefficiency and destabilization of correct reasoning.
- Overthinking and self-correction mechanisms can cause models to abandon correct solution paths, resulting in error spirals.
- On high-complexity problems, models exhibit a reluctance to perform tedious computations, preferring heuristic or conjectural reasoning even when their arithmetic accuracy remains high.
2. Reinforcement Learning (RL) and Generalization
Fine-tuning Qwen-series models with RL (using the GRPO algorithm) yields notable improvements in exploratory generalization, particularly on in-distribution and moderately OOD tasks. However:
- Exploratory Generalization: RL boosts accuracy on harder instances within known domains, with gains more pronounced on in-distribution tasks.
- Compositional Generalization: RL leads to strong performance on isolated skills but fails to reliably transfer these gains to composed tasks requiring skill integration. Even when models master individual skills, they struggle to synthesize them for novel problems.
- Transformative Generalization: RL provides negligible improvement on tasks requiring a shift to new solution paradigms. In some cases, RL fine-tuning entrenches brittle heuristics, reducing performance on OOD tasks that demand creative reframing.
3. Inference-Time Compute and Scaling
Increasing inference-time compute (e.g., Pass@k with up to 32 candidates) improves performance on low-complexity problems but offers diminishing returns as complexity increases. For certain combinatorial tasks, performance drops to zero at higher complexity levels, despite the problem remaining within the models' context window. This indicates that brute-force scaling of inference-time compute is insufficient to overcome fundamental reasoning limitations.
Strong Numerical Results and Contradictory Claims
- RL fine-tuning increases in-distribution accuracy by up to 61 percentage points in some domains (e.g., Logic Zebra), but OOD gains are typically lower and highly variable.
- On compositional generalization, models achieve >69% accuracy on individual skills after RL, yet show little to no improvement on composed tasks, with gains often limited to +6% or less.
- For transformative generalization, OOD accuracy remains near zero after RL, with the only exception being minor gains on simple variants where conventional strategies still apply. In some cases, RL reduces OOD performance by up to 30 percentage points, highlighting the risk of overfitting to familiar patterns.
Implications and Future Directions
OMEGA exposes critical gaps in the reasoning capabilities of current LLMs, particularly in their ability to generalize compositionally and transformatively. The findings suggest that:
- RL and increased compute can amplify proficiency within known domains but do not foster the creative leaps required for true mathematical innovation.
- Current models lack mechanisms for meta-reasoning, such as detecting when a default strategy fails and actively searching for alternative approaches.
- Short-term fixes, such as targeted data augmentation or synthetic scaffolding, may obscure deeper structural weaknesses in model reasoning.
To advance toward human-level mathematical problem-solving, future research should explore:
- Curriculum Scaffolding: Dynamically ordering tasks to gradually introduce compositional and transformative challenges.
- Meta-Reasoning Controllers: Developing mechanisms that can recognize failure modes and trigger exploration of new solution strategies.
- Architectural Innovations: Incorporating inductive biases or modular structures that facilitate skill integration and flexible strategy switching.
Theoretical and Practical Significance
OMEGA provides a reproducible, infinitely scalable testbed for diagnosing the generalization limits of LLMs in mathematical reasoning. By isolating and quantifying fine-grained failures, it enables targeted research into the mechanisms underlying compositionality and creativity. The benchmark's design principles—template-driven generation, controlled complexity, and domain coverage—set a new standard for evaluating reasoning models beyond mechanical proficiency.
In practical terms, OMEGA highlights the need for smarter scaling strategies and more principled approaches to model training and evaluation. The results caution against over-reliance on brute-force scaling or RL fine-tuning as panaceas for reasoning limitations. Instead, they point toward the necessity of fundamentally new methods for equipping models with robust, efficient, and creative mathematical reasoning capabilities.
Speculation on Future Developments
As LLMs continue to evolve, benchmarks like OMEGA will be essential for tracking progress toward genuine mathematical creativity. Future models may incorporate explicit meta-reasoning modules, curriculum learning frameworks, or hybrid neuro-symbolic architectures to address the compositional and transformative gaps identified by OMEGA. The benchmark's extensibility and fine-grained control make it a valuable resource for both empirical evaluation and theoretical analysis of reasoning in AI systems.