Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

OMEGA: OOD Math Problems Evaluation

Updated 30 June 2025

OMEGA is a benchmark that systematically evaluates large language models using controlled, programmatically generated math problems covering exploratory, compositional, and transformative generalization.
It employs principled train-test splits and automatic verification methods to rigorously quantify limits in mathematical reasoning.
Empirical findings reveal significant performance drops in compositional and transformative tasks, underscoring the need for new training paradigms.

OMEGA-Out-of-distribution Math Problems Evaluation is a systematic benchmark for quantifying and diagnosing the limits of LLMs in mathematical reasoning beyond conventional in-distribution generalization. OMEGA introduces controlled, programmatically generated math problems and curated train-test splits to test three explicit axes of out-of-distribution (OOD) generalization: exploratory, compositional, and transformative. The benchmark is grounded in ideas from creativity studies and is designed to expose whether LLMs can extend, compose, or shift reasoning strategies when faced with unfamiliar or escalated mathematical challenges.

1. Controlled Benchmark Structure and Purpose

OMEGA is constructed to move beyond canonical math evaluation sets by providing programmatic, template-driven generation of problems spanning arithmetic, algebra, combinatorics, number theory, geometry, logic, and puzzles. Unlike most benchmarks that mix problem types or ad hoc OOD divisions, OMEGA assigns problems to train and test splits based on principled generalization criteria:

Exploratory Generalization: Test items increase complexity within a problem template that was seen in training (e.g., higher-order combinatorial counts, larger polytopes).
Compositional Generalization: Test samples require a coherent integration of two or more distinct skills that were only encountered in isolation during training.
Transformative Generalization: Test items are structurally crafted such that solution via previously learned strategies is ineffective or infeasible, mandating the discovery of new, often unconventional, approaches.

Problems, answers, and reasoning chains (“solutions verified using symbolic, numerical, or graphical methods”) are created by combining parameterized templates with solution verification routines, ensuring that the train-test splits are both rigorous and automatically checkable.

2. Three Axes of Out-of-Distribution Generalization

The generalization axes are defined as follows:

Exploratory Generalization
- Measures the ability to apply familiar strategies to more complex, but structurally similar, problems.
- Example: Training on rectangle-counting in polygons up to $n=8$ and testing on polygons with $n=12$ vertices.
- Isolates whether scaling up complexity within a learned domain is handled robustly.
Compositional Generalization
- Assesses whether models can integrate distinct, previously isolated reasoning skills to solve new composite tasks.
- Example: Training separately on (A) polynomial root finding and (B) GCD computation, then testing on problems requiring GCD of polynomials followed by root finding.
- Evaluates if true synthesis of skills can emerge from isolated exposure.
Transformative Generalization
- Tests for qualitative shifts in problem-solving—models must abandon rote tactics and invent new solutions.
- Example: Training on routine polynomial equations but testing with a structure that mandates new substitutions (e.g., leveraging symmetry: $x = t + a/t$ ).
- Targeted at uncovering the capacity for “creative” leaps beyond procedural patterns.

All three axes are defined operationally, with test cohorts specifically constructed so success cannot be achieved by mere interpolation or extension of the training distribution.

3. Problem Generation and Automatic Verification

OMEGA employs over 40 formal problem-generation templates encompassing a wide and diverse range of mathematical content. Each problem is instantiated with random parameters under domain constraints, and solutions are validated with computer algebra, numeric computation, or geometric reasoning algorithms:

Exploratory: Parameters (e.g., $n$ in an $n$ -gon, degree of equations) are varied to increase difficulty for test items.
Compositional: Templates are assembled to combine skill requirements, ensuring that test items cannot be solved by naively applying a single learned tactic.
Transformative: Intractable cases (where usual algorithms fail or become computationally prohibitive) are programmatically detected, and “shortcut” approaches are embedded in the solution logic.

This programmatic approach enables precise, large-scale studies of generalization and avoids contamination from pretraining on public math datasets.

4. Empirical Findings: LLM Performance Across the Three Axes

Comprehensive evaluation of state-of-the-art LLMs, including DeepSeek-R1 and the Qwen-series, reveals a systematic erosion of performance as OOD complexity increases:

Exploratory Generalization: Models show degraded but noticeable transfer, with finer granularity revealing large drops for higher complexity scales than previously reported in general math benchmarks.
Compositional Generalization: While fine-tuning and supervised reinforcement learning improve accuracy for isolated skills, models struggle to integrate those skills to solve test cases requiring both, indicating the absence of robust skill synthesis.
Transformative Generalization: Models frequently fail to discover or adopt new, more efficient strategies when standard methods are ineffective, with little or no improvement even after exposure to challenging or richer data.

Analysis of chain-of-thought traces shows that, for harder (especially OOD) problems, correct lines are often followed by incorrect “overthinking” or destructive overwrite, and models increasingly skip systematic computation in favor of guesses as complexity rises.

5. Effects of Reinforcement Learning and Fine-Tuning

Fine-tuning (including GRPO-based reinforcement learning) on Qwen-series models yields significant improvements for exploratory tasks and isolated skill targets. However, these gains stall for compositional and transformative settings:

Exploratory: Gains are substantial up to moderate complexity but saturate at a ceiling determined by the complexity gap from seen data.
Compositional: Only marginal improvement is observed in the integration of isolated skills; most skill pairings show no systematic gain.
Transformative: Performance on emergent-solution tasks remains largely unaffected or even declines, confirming the limitation of current training paradigms for fostering genuine creative reasoning.

This suggests that robust OOD creative behavior is not a simple consequence of more data or curriculum-based reinforcement, possibly necessitating fundamentally new training or architectural strategies.

6. Implications for Benchmark Development and Mathematical Creativity

OMEGA sets out a blueprint for future OOD math benchmarks and evaluation protocols:

Emphasizing contamination-free, template-driven, and programmatically verified test sets to guarantee precise OOD assessment.
Structuring test splits according to explicit generalization axes, not only to diagnose aggregate model performance but to isolate and quantify the boundaries of LLM creativity.
Exposing fundamental gaps in compositional and transformative reasoning, which remain unresolved even in frontier models and transfer learning regimes.
Providing a scalable, customizable foundation that is robust to future advances in both LLM scale and training procedure.

A plausible implication is that, going forward, true advancements in LLM mathematical creativity will require directed architectural innovations, training regimes, or meta-reasoning mechanisms that target the compositional and transformative axes directly, rather than simply scaling data or reinforcement learning on routine skills.

7. Summary Table: OMEGA Generalization Axes and Model Performance

Axis (Editor’s term)	Test Set Construction	OOD Performance	Effect of RL/Fine-tuning
Exploratory	Same skill, higher complexity	Moderately degraded, scalable; ceiling observed	Significant gains up to a complexity threshold
Compositional	Novel skill combinations	Severe degradation	Isolated gains, limited or no composition
Transformative	Efficient new solution required	Sharpest drop	No meaningful improvement

References:

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization (2506.18880)
Boden, M. A. (1998) Creativity and artificial intelligence.
Also see OMEGA code and data release: https://github.com/sunblaze-ucb/math_ood

PDF Markdown Chat (Upgrade)

References (1)

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization (2025)