Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OMEGA Benchmark

Updated 1 July 2025
  • OMEGA Benchmark is a suite designed to systematically evaluate the mathematical reasoning capabilities of large language models, focusing on their ability to generalize and exhibit creativity.
  • The benchmark defines and tests three axes of generalization—exploratory, compositional, and transformative—across six mathematical domains to diagnose model limitations beyond simple proficiency.
  • Key findings show state-of-the-art LLMs struggle with integrating skills and generating novel solution strategies, indicating that scaling and fine-tuning primarily enhance mechanical proficiency rather than mathematical creativity.

The OMEGA Benchmark is a suite of systematically designed evaluations that probe the mathematical reasoning capabilities of LLMs, with a focus on diagnosing and quantifying their ability to generalize and exhibit creativity in mathematical problem-solving. Drawing on principles from cognitive science, particularly Boden’s typology of creativity, OMEGA delineates three distinct axes of out-of-distribution generalization—exploratory, compositional, and transformative—and offers a unified experimental framework across six core mathematical domains. The benchmark exposes persistent limitations of state-of-the-art LLMs in moving beyond mechanical proficiency toward genuinely novel mathematical reasoning, while providing a foundation for targeted model development.

1. Benchmark Design and Motivation

OMEGA was created to address the empirical observation that even the highest-performing LLMs, such as DeepSeek-R1 and the Qwen-series, display sharp performance drops and visible strategy narrowness when encountering problems beyond the scope of their training distribution. The benchmark was carefully constructed to differentiate between various forms of generalization—applying known strategies at increased complexity, integrating previously isolated skills, or formulating new solution methods—by generating programmatically controlled training and test pairs.

The design aims to isolate whether models can:

  • Apply a learned skill to more challenging variants (exploratory generalization),
  • Integrate multiple learned skills into a coherent solution (compositional generalization), or
  • Abandon well-trodden strategies in favor of truly novel ones (transformative generalization).

Problems are sourced from templated generators covering geometry, number theory, algebra, combinatorics, logic, and puzzles, with all solutions verified by symbolic, numerical, or graphical (e.g., OpenCV for geometry) methods to guarantee correctness. Problem templates are parameterized by complexity vectors θ\theta with explicit complexity scoring functions δ(θ)\delta(\theta), enabling controlled partitioning between training and evaluation sets.

2. Generalization Axes and Problem Typology

OMEGA operationalizes Boden’s creative typology into three experimentally orthogonal generalization axes:

  1. Exploratory Generalization: Evaluates whether LLMs can generalize a previously learned solution within a domain to higher complexity. For example, after training on counting rectangles in octagons, does performance extrapolate to dodecagons? The evaluation set contains strictly more complex instances than the training set (δtest>δtrain\delta_{\text{test}} > \delta_{\text{train}}).
  2. Compositional Generalization: Assesses if LLMs can integrate skills learned in isolation. For example, if trained separately on finding polynomial roots and extracting GCDs, can a model solve problems requiring their combined use in new mathematical compositions? Test instances require skill synthesis absent from training distribution.
  3. Transformative Generalization: Probes whether LLMs can recognize and deploy fundamentally new strategies, especially when familiar tactics become ineffective. For example, LLMs may be trained on problems amenable to exhaustive search but evaluated on instances where symmetry or an algebraic insight renders exhaustive enumeration infeasible, requiring an innovative solution.

Within these axes, templates are designed to ensure no leakage between train and test regimes, and ground-truth solutions are constructed by programmatic means—eliminating label noise and making large-scale benchmarking feasible.

3. Mathematical Domains and Methodology

OMEGA’s problem space includes six broad mathematical domains:

  • Arithmetic: GCD, prime factorization, matrix rank determination.
  • Algebra: Equation solving, polynomial roots, function intersection, area calculations.
  • Combinatorics: Letter arrangement enumeration, substring matching, derangements.
  • Number Theory: Digital sums, modular arithmetic, prime decomposition.
  • Geometry: Counting subfigures (e.g., rectangles), polygons, tangencies, symmetry group actions.
  • Logic and Puzzles: Pathfinding on grids, logic games, pattern deduction.

Each domain’s template is parameterized (e.g., by polygon size, polynomial degree, matrix order), with a complexity measure δ(θ)\delta(\theta) (e.g., number of required reasoning steps). All instances are synthesized via code—symbolic for algebra/number theory, numeric for arithmetic, and computer-vision for some combinatorial/geometry problems (e.g., cv2.approxPolyDP for polygon recognition).

4. Experimental Evaluation and Findings

The OMEGA benchmark was used to evaluate the reasoning performance of several top-tier LLMs, including DeepSeek-R1, Claude 3.7 Sonnet, OpenAI o3-mini, o4-mini, and the Qwen-series. The methodology involved training on a controlled subset and evaluating on the OOD regime defined by each generalization axis.

Key findings include:

  • Sharp accuracy drop with complexity: For all models, as the complexity parameter increases (e.g., number of reasoning steps, combinatorial branching), exact-match accuracy drops from near perfect to near zero. This holds even for models that excel on standard Olympiad benchmarks.
  • Overthinking and spiral errors: Chain-of-thought traces typically reveal “overthinking to error” (models revise correct initial answers to wrong ones) and recursive spiraling without convergence, especially as complexity grows.
  • Computation scaling limits: While increasing sampling at inference time (pass@kk metrics) somewhat improves performance for moderate complexity, it fails to counteract reasoning breakdown at high complexity, indicating a fundamental limitation not addressable by brute-force ensemble approaches.

For the Qwen-series, RL fine-tuning (using the GRPO algorithm) yielded:

  • Remarkable improvements in exploratory generalization: RL training on logic and arithmetic templates dramatically lifted both in-distribution and out-of-distribution complexity accuracy (e.g., up to +61% in ID and +53% in OOD for Logic Zebra).
  • Limited compositional gains: RL fine-tuning provided benefit on skill integration only when the training composition was semantically similar to evaluation; otherwise, integration of disparate skills remained a bottleneck.
  • Minimal transformative benefit: RL improved in-distribution proficiency, but left performance on transformative-generalization (novel strategy) tasks at zero.

5. Implications and Future Perspectives

OMEGA’s results highlight that:

  • LLMs excel at mechanical proficiency (interpolative and low-complexity generalization) but exhibit entrenched brittleness once required to integrate multiple skills or depart from scripted solution strategies.
  • Fine-tuning and scaling amplify proficiency, not creativity: While RL boosts accuracy for more complex interpolations, it does not substantially improve compositional synergy or foster new problem-solving approaches.
  • Entrenchment risk: RL can further entrench suboptimal strategies, decreasing flexibility and generalization to transformative tasks, a phenomenon evident in some ablation results.
  • Skill modularity and meta-reasoning are open challenges: There is a clear gap between LLMs' skill acquisition and their ability to dynamically combine, adapt, and generalize those skills as humans would.

A plausible implication is that true mathematical creativity—mirroring the exploratory, compositional, and especially the transformative aspects identified by Boden—will require new architectures, compositional curricula, and potentially meta-cognitive controllers that can recognize failure of current strategies and actively switch to new paradigms.

6. Illustrative Problem Templates and Complexity Metrics

OMEGA’s architecture enables granular analysis by directly controlling problem complexity and solution method. For example:

  • Polygon rectangle counting (exploratory):

Count rectangles in a regular n-gon\text{Count rectangles in a regular } n\text{-gon}

\to Increasing nn tests the ability to generalize counting logic.

  • Composite skill task (compositional):

Given f(x)=x26x+8 and g(x)=x25x+6, compute gcd(f(x),g(x)) then find the integer roots.\text{Given } f(x)=x^2-6x+8 \text{ and } g(x) = x^2-5x+6, \text{ compute } \gcd(f(x),g(x)) \text{ then find the integer roots.}

  • Transformative task (matrix rank):

En=[eij],eij={1,i+j even 0,otherwiseE_n = [e_{ij}],\quad e_{ij} = \begin{cases} 1, & i+j \text{ even} \ 0, & \text{otherwise} \end{cases}

Recognizing EnE_n as a sum of two rank-1 matrices (so rank(En)=2\mathrm{rank}(E_n) = 2 for n2n\geq 2) requires breaking from rote calculation and identifying an underlying structure.

Each problem template includes a fully specified solution function (symbolic, numeric, or visual) and a complexity score, enabling precise experimental design and error attribution.


Summary Table: OMEGA Axes and LLM Performance

Axis Evaluation Focus RL/Tuning Impact Typical LLM Outcome
Exploratory Complexity within known skills Substantial improvement Accuracy increases, saturates
Compositional Integration of distinct skills Marginal improvement Skill use, little integration
Transformative Adoption of new strategies No/little improvement Accuracy remains near zero

OMEGA provides a rigorous, extensible platform for diagnosing model deficits, quantifying mathematical generalization, and inspiring further advances in genuinely creative AI reasoning for mathematics. It sets a new standard for what it means for LLMs to "reason outside the box," and supplies the granular structure necessary for future research in curriculum design, architectural modularity, and meta-reasoning in mathematical AI.

Resource:

OMEGA benchmark code and data are available at https://github.com/sunblaze-ucb/math_ood. Reference: Sun, Yiyou, et al. "OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization." (2025).