CHIMERA: Synthetic Reasoning Dataset for LLMs
- The paper introduces a fully automated LLM-driven pipeline that performs subject expansion, problem generation, and solution synthesis to create high-quality reasoning data.
- The dataset comprises 9,225 samples with extensive chain-of-thought trajectories (average solution length over 11K tokens) covering 8 disciplines.
- CHIMERA enables a 4B-parameter model to achieve performance comparable to models up to 235B parameters by addressing data cold-start and annotation bottlenecks.
CHIMERA is a compact, synthetic reasoning dataset designed to advance generalizable, cross-domain reasoning in LLMs. Comprising 9,225 samples, each with long-form chain-of-thought (CoT) trajectories, it addresses major data-centric challenges such as the cold-start problem in seed data, limited subject coverage in existing open datasets, and the annotation bottleneck for high-level reasoning. Its pipeline is fully automated, leveraging multiple state-of-the-art LLMs for generation, validation, and verification, with no human annotation. Despite its modest size, CHIMERA effectively boosts the reasoning capabilities of 4B-parameter models to levels close to models two orders of magnitude larger, as evidenced by strong benchmark results (Zhu et al., 1 Mar 2026).
1. Synthetic Dataset Construction Pipeline
CHIMERA’s generation process consists of a fully automated three-stage pipeline, with all stages handled by LLM APIs. The pipeline executes as follows:
Pipeline Stages
- Subject Expansion: Starting from 8 coarse-grained disciplines (Mathematics, Physics, Chemistry, Biology, Computer Science, Literature, History, Linguistics), gpt-5 expands each subject into 1,179 unique, fine-grained topics using prompt-based sampling.
- Problem Generation: For each topic, gpt-5 is prompted to generate PhD-level problems and reference answers. Each (question, answer) pair is string-deduplicated.
- Solution Synthesis: Qwen3-235B-A22B-Thinking-2507 generates detailed CoT solutions. Validation is performed by two independent LLMs (gpt-5, o4-mini), filtering for well-posedness and correctness.
Algorithmic Specification
The formal synthesis pipeline is represented in the following pseudocode (Algorithm 1):
0
A two-level taxonomy is generated entirely by LLM prompts. No external clustering or manual curation is performed.
2. Dataset Structure and Characteristics
CHIMERA’s architecture embodies three central properties: rich chain-of-thought trajectories, broad and structured subject coverage, and rigorous, fully-automated quality control.
Chain-of-Thought Depth
- Solution length: Mean 11,121 tokens per solution (substantially longer than GSM8K's 205 words).
- Prompt length: Average 211 words, supporting extended multi-step reasoning processes.
Domain Coverage
- Disciplines: 8 high-level fields.
- Topics: 1,179 fine-grained model-generated topics.
- Distribution: Mathematics constitutes 48.3% of the data.
Data Format
All records adhere to a standardized JSON schema: 1
Comparative Dataset Metrics
| Dataset | #Problems | Subjects | Prompt Len | Solu. Len |
|---|---|---|---|---|
| GSM8K | 7,473 | 1 | 45.1 | 51.7 |
| MATH | 7,500 | 1 | 33.0 | 89.5 |
| OpenScience | 315,579 | — | 76.1 | 1,296.8 |
| CHIMERA | 9,225 | 8 | 211.1 | 11,121.4 |
Quality control is enforced strictly through consensus of LLM validators and verifiers.
3. Automated Quality Assurance and Contamination Control
Validation and verification of all entries in CHIMERA are performed via an LLM-based cross-model protocol:
- Problem validity: Both gpt-5 and o4-mini verdicts of "VALID" are mandatory.
- Solution correctness: o4-mini serves as the answer extractor and judge, flagging "correct: yes/no."
- Sample retention: Only samples with dual VALID validation and correct CoT pass the filters; otherwise, they are discarded or reserved for RL training.
Mathematically, the filter is:
Contamination is assessed using Jaccard n-gram overlap scores: where extracts distinct -grams. Overlap with major public benchmarks is essentially zero.
4. Model Post-Training and Empirical Performance
CHIMERA is applied to post-train Qwen3-4B-Thinking-2507 in two stages:
- A) Supervised Fine-Tuning (SFT):
- All samples with
- Batch size 256, learning rate , 5K steps to convergence
- B) Reinforcement Learning (CISPO protocol):
- One epoch mixing SFT set and unsolved problems
- Batch size 256, learning rate , 8 rollouts per prompt
- Reward signal from o4-mini
Evaluation settings: Decoding at temperature 0.6, top-p 0.95, top-k 20, sampling solutions per prompt, reporting pass@$1$.
Benchmark Performance
| Model | Params | GPQA | AIME24 | AIME25 | AIME26 | HMMT25 | HMMTNov | HLE |
|---|---|---|---|---|---|---|---|---|
| Qwen3-4B-Thinking-2507 | 4B | 65.8 | 81.6 | 81.0 | 80.8 | 59.2 | 57.3 | 7.3 |
| + OpenScience SFT | 4B | 53.5 | 61.7 | 53.3 | 53.0 | 40.0 | 36.9 | 4.6 |
| + CHIMERA (SFT+RL) | 4B | 70.1 | 86.9 | 80.7 | 82.7 | 65.7 | 67.0 | 9.0 |
| DeepSeek-R1 | 671B | 81.0 | 91.4 | 87.5 | — | 79.4 | — | 17.7 |
| Qwen3-235B-A22B-Thinking | 235B | 81.1 | — | 92.3 | — | 83.9 | — | 18.2 |
Key observations:
- CHIMERA SFT alone yields improvements of up to +9.7 percentage points on held-out reasoning tasks.
- SFT on OpenScience degrades performance, plausibly due to mismatched problem formats.
- CHIMERA enables a 4B-parameter model to approach, and in some cases match, the performance of models up to 235B parameters on advanced benchmarks.
5. Analysis, Data Quality, and Future Prospects
Difficulty Gradient
Base Qwen3-4B achieves 88% on other synthetic sets (DAPO-Math-17K, DeepMath-103K, OpenScience) but only 37.5% on CHIMERA, indicating high intrinsic difficulty and significant learning capacity required.
Quality Assessment
Blind ranking by o4-mini and gemini-2.5-pro (200-sample comparison) shows LLM-generated problems in CHIMERA are, on average, superior to human-authored Humanity’s Last Exam samples in clarity, difficulty, and solvability.
Known Limitations
- Imbalanced subject representation: Mathematics constitutes nearly half of the dataset.
- Problem format: Exclusively free-response; no multiple-choice or multimodal content.
- Automated validation only: Minor error detection failures may persist, as no human validation is performed.
Planned Directions
Proposed enhancements include:
- Subject expansion to social sciences, engineering, and interdisciplinary areas
- Human-in-the-loop annotation for sparse subfields
- Inclusion of adversarial “hard negatives” for robustness
- Hierarchical curriculum structures
- Release of subject–topic embeddings to facilitate further research and cluster analysis
CHIMERA demonstrates that a compact synthetic dataset, coupled with rigorous LLM-driven filtering and long-form reasoning supervision, suffices to bootstrap robust reasoning capacities in mid-scale LLMs, yielding performance that rivals much larger contemporaries (Zhu et al., 1 Mar 2026).