OMEGA Benchmark
- OMEGA Benchmark is a suite designed to systematically evaluate the mathematical reasoning capabilities of large language models, focusing on their ability to generalize and exhibit creativity.
- The benchmark defines and tests three axes of generalization—exploratory, compositional, and transformative—across six mathematical domains to diagnose model limitations beyond simple proficiency.
- Key findings show state-of-the-art LLMs struggle with integrating skills and generating novel solution strategies, indicating that scaling and fine-tuning primarily enhance mechanical proficiency rather than mathematical creativity.
The OMEGA Benchmark is a suite of systematically designed evaluations that probe the mathematical reasoning capabilities of LLMs, with a focus on diagnosing and quantifying their ability to generalize and exhibit creativity in mathematical problem-solving. Drawing on principles from cognitive science, particularly Boden’s typology of creativity, OMEGA delineates three distinct axes of out-of-distribution generalization—exploratory, compositional, and transformative—and offers a unified experimental framework across six core mathematical domains. The benchmark exposes persistent limitations of state-of-the-art LLMs in moving beyond mechanical proficiency toward genuinely novel mathematical reasoning, while providing a foundation for targeted model development.
1. Benchmark Design and Motivation
OMEGA was created to address the empirical observation that even the highest-performing LLMs, such as DeepSeek-R1 and the Qwen-series, display sharp performance drops and visible strategy narrowness when encountering problems beyond the scope of their training distribution. The benchmark was carefully constructed to differentiate between various forms of generalization—applying known strategies at increased complexity, integrating previously isolated skills, or formulating new solution methods—by generating programmatically controlled training and test pairs.
The design aims to isolate whether models can:
- Apply a learned skill to more challenging variants (exploratory generalization),
- Integrate multiple learned skills into a coherent solution (compositional generalization), or
- Abandon well-trodden strategies in favor of truly novel ones (transformative generalization).
Problems are sourced from templated generators covering geometry, number theory, algebra, combinatorics, logic, and puzzles, with all solutions verified by symbolic, numerical, or graphical (e.g., OpenCV for geometry) methods to guarantee correctness. Problem templates are parameterized by complexity vectors with explicit complexity scoring functions , enabling controlled partitioning between training and evaluation sets.
2. Generalization Axes and Problem Typology
OMEGA operationalizes Boden’s creative typology into three experimentally orthogonal generalization axes:
- Exploratory Generalization: Evaluates whether LLMs can generalize a previously learned solution within a domain to higher complexity. For example, after training on counting rectangles in octagons, does performance extrapolate to dodecagons? The evaluation set contains strictly more complex instances than the training set ().
- Compositional Generalization: Assesses if LLMs can integrate skills learned in isolation. For example, if trained separately on finding polynomial roots and extracting GCDs, can a model solve problems requiring their combined use in new mathematical compositions? Test instances require skill synthesis absent from training distribution.
- Transformative Generalization: Probes whether LLMs can recognize and deploy fundamentally new strategies, especially when familiar tactics become ineffective. For example, LLMs may be trained on problems amenable to exhaustive search but evaluated on instances where symmetry or an algebraic insight renders exhaustive enumeration infeasible, requiring an innovative solution.
Within these axes, templates are designed to ensure no leakage between train and test regimes, and ground-truth solutions are constructed by programmatic means—eliminating label noise and making large-scale benchmarking feasible.
3. Mathematical Domains and Methodology
OMEGA’s problem space includes six broad mathematical domains:
- Arithmetic: GCD, prime factorization, matrix rank determination.
- Algebra: Equation solving, polynomial roots, function intersection, area calculations.
- Combinatorics: Letter arrangement enumeration, substring matching, derangements.
- Number Theory: Digital sums, modular arithmetic, prime decomposition.
- Geometry: Counting subfigures (e.g., rectangles), polygons, tangencies, symmetry group actions.
- Logic and Puzzles: Pathfinding on grids, logic games, pattern deduction.
Each domain’s template is parameterized (e.g., by polygon size, polynomial degree, matrix order), with a complexity measure (e.g., number of required reasoning steps). All instances are synthesized via code—symbolic for algebra/number theory, numeric for arithmetic, and computer-vision for some combinatorial/geometry problems (e.g., cv2.approxPolyDP
for polygon recognition).
4. Experimental Evaluation and Findings
The OMEGA benchmark was used to evaluate the reasoning performance of several top-tier LLMs, including DeepSeek-R1, Claude 3.7 Sonnet, OpenAI o3-mini, o4-mini, and the Qwen-series. The methodology involved training on a controlled subset and evaluating on the OOD regime defined by each generalization axis.
Key findings include:
- Sharp accuracy drop with complexity: For all models, as the complexity parameter increases (e.g., number of reasoning steps, combinatorial branching), exact-match accuracy drops from near perfect to near zero. This holds even for models that excel on standard Olympiad benchmarks.
- Overthinking and spiral errors: Chain-of-thought traces typically reveal “overthinking to error” (models revise correct initial answers to wrong ones) and recursive spiraling without convergence, especially as complexity grows.
- Computation scaling limits: While increasing sampling at inference time (pass@ metrics) somewhat improves performance for moderate complexity, it fails to counteract reasoning breakdown at high complexity, indicating a fundamental limitation not addressable by brute-force ensemble approaches.
For the Qwen-series, RL fine-tuning (using the GRPO algorithm) yielded:
- Remarkable improvements in exploratory generalization: RL training on logic and arithmetic templates dramatically lifted both in-distribution and out-of-distribution complexity accuracy (e.g., up to +61% in ID and +53% in OOD for Logic Zebra).
- Limited compositional gains: RL fine-tuning provided benefit on skill integration only when the training composition was semantically similar to evaluation; otherwise, integration of disparate skills remained a bottleneck.
- Minimal transformative benefit: RL improved in-distribution proficiency, but left performance on transformative-generalization (novel strategy) tasks at zero.
5. Implications and Future Perspectives
OMEGA’s results highlight that:
- LLMs excel at mechanical proficiency (interpolative and low-complexity generalization) but exhibit entrenched brittleness once required to integrate multiple skills or depart from scripted solution strategies.
- Fine-tuning and scaling amplify proficiency, not creativity: While RL boosts accuracy for more complex interpolations, it does not substantially improve compositional synergy or foster new problem-solving approaches.
- Entrenchment risk: RL can further entrench suboptimal strategies, decreasing flexibility and generalization to transformative tasks, a phenomenon evident in some ablation results.
- Skill modularity and meta-reasoning are open challenges: There is a clear gap between LLMs' skill acquisition and their ability to dynamically combine, adapt, and generalize those skills as humans would.
A plausible implication is that true mathematical creativity—mirroring the exploratory, compositional, and especially the transformative aspects identified by Boden—will require new architectures, compositional curricula, and potentially meta-cognitive controllers that can recognize failure of current strategies and actively switch to new paradigms.
6. Illustrative Problem Templates and Complexity Metrics
OMEGA’s architecture enables granular analysis by directly controlling problem complexity and solution method. For example:
- Polygon rectangle counting (exploratory):
Increasing tests the ability to generalize counting logic.
- Composite skill task (compositional):
- Transformative task (matrix rank):
Recognizing as a sum of two rank-1 matrices (so for ) requires breaking from rote calculation and identifying an underlying structure.
Each problem template includes a fully specified solution function (symbolic, numeric, or visual) and a complexity score, enabling precise experimental design and error attribution.
Summary Table: OMEGA Axes and LLM Performance
Axis | Evaluation Focus | RL/Tuning Impact | Typical LLM Outcome |
---|---|---|---|
Exploratory | Complexity within known skills | Substantial improvement | Accuracy increases, saturates |
Compositional | Integration of distinct skills | Marginal improvement | Skill use, little integration |
Transformative | Adoption of new strategies | No/little improvement | Accuracy remains near zero |
OMEGA provides a rigorous, extensible platform for diagnosing model deficits, quantifying mathematical generalization, and inspiring further advances in genuinely creative AI reasoning for mathematics. It sets a new standard for what it means for LLMs to "reason outside the box," and supplies the granular structure necessary for future research in curriculum design, architectural modularity, and meta-reasoning in mathematical AI.
Resource:
OMEGA benchmark code and data are available at https://github.com/sunblaze-ucb/math_ood. Reference: Sun, Yiyou, et al. "OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization." (2025).