CHIMERA: Synthetic Reasoning Dataset for LLMs

Updated 1 May 2026

The paper introduces a fully automated LLM-driven pipeline that performs subject expansion, problem generation, and solution synthesis to create high-quality reasoning data.
The dataset comprises 9,225 samples with extensive chain-of-thought trajectories (average solution length over 11K tokens) covering 8 disciplines.
CHIMERA enables a 4B-parameter model to achieve performance comparable to models up to 235B parameters by addressing data cold-start and annotation bottlenecks.

CHIMERA is a compact, synthetic reasoning dataset designed to advance generalizable, cross-domain reasoning in LLMs. Comprising 9,225 samples, each with long-form chain-of-thought (CoT) trajectories, it addresses major data-centric challenges such as the cold-start problem in seed data, limited subject coverage in existing open datasets, and the annotation bottleneck for high-level reasoning. Its pipeline is fully automated, leveraging multiple state-of-the-art LLMs for generation, validation, and verification, with no human annotation. Despite its modest size, CHIMERA effectively boosts the reasoning capabilities of 4B-parameter models to levels close to models two orders of magnitude larger, as evidenced by strong benchmark results (Zhu et al., 1 Mar 2026).

1. Synthetic Dataset Construction Pipeline

CHIMERA’s generation process consists of a fully automated three-stage pipeline, with all stages handled by LLM APIs. The pipeline executes as follows:

Pipeline Stages

Subject Expansion: Starting from 8 coarse-grained disciplines (Mathematics, Physics, Chemistry, Biology, Computer Science, Literature, History, Linguistics), gpt-5 expands each subject into 1,179 unique, fine-grained topics using prompt-based sampling.
Problem Generation: For each topic, gpt-5 is prompted to generate PhD-level problems and reference answers. Each (question, answer) pair is string-deduplicated.
Solution Synthesis: Qwen3-235B-A22B-Thinking-2507 generates detailed CoT solutions. Validation is performed by two independent LLMs (gpt-5, o4-mini), filtering for well-posedness and correctness.

Algorithmic Specification

The formal synthesis pipeline is represented in the following pseudocode (Algorithm 1):

$\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t_i\in\mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t_i)\cap G_n(s)|}{|G_n(t_i)\cup G_n(s)|},$ 0

A two-level taxonomy is generated entirely by LLM prompts. No external clustering or manual curation is performed.

2. Dataset Structure and Characteristics

CHIMERA’s architecture embodies three central properties: rich chain-of-thought trajectories, broad and structured subject coverage, and rigorous, fully-automated quality control.

Chain-of-Thought Depth

Solution length: Mean 11,121 tokens per solution (substantially longer than GSM8K's 205 words).
Prompt length: Average 211 words, supporting extended multi-step reasoning processes.

Domain Coverage

Disciplines: 8 high-level fields.
Topics: 1,179 fine-grained model-generated topics.
Distribution: Mathematics constitutes 48.3% of the data.

Data Format

All records adhere to a standardized JSON schema: $\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t_i\in\mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t_i)\cap G_n(s)|}{|G_n(t_i)\cup G_n(s)|},$ 1

Comparative Dataset Metrics

Dataset	#Problems	Subjects	Prompt Len	Solu. Len
GSM8K	7,473	1	45.1	51.7
MATH	7,500	1	33.0	89.5
OpenScience	315,579	—	76.1	1,296.8
CHIMERA	9,225	8	211.1	11,121.4

Quality control is enforced strictly through consensus of LLM validators and verifiers.

3. Automated Quality Assurance and Contamination Control

Validation and verification of all entries in CHIMERA are performed via an LLM-based cross-model protocol:

Problem validity: Both gpt-5 and o4-mini verdicts of "VALID" are mandatory.
Solution correctness: o4-mini serves as the answer extractor and judge, flagging "correct: yes/no."
Sample retention: Only samples with dual VALID validation and correct CoT pass the filters; otherwise, they are discarded or reserved for RL training.

Mathematically, the filter is: $\text{Keep sample} \iff V_{\mathrm{gpt5}}(q,a)=\text{VALID} \wedge V_{\mathrm{o4}}(q,a)=\text{VALID} \wedge C_{\mathrm{o4}}(q,a,r)=\text{yes}.$

Contamination is assessed using Jaccard n-gram overlap scores: $\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t_i\in\mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t_i)\cap G_n(s)|}{|G_n(t_i)\cup G_n(s)|},$ where $G_n(\cdot)$ extracts distinct $n$ -grams. Overlap with major public benchmarks is essentially zero.

4. Model Post-Training and Empirical Performance

CHIMERA is applied to post-train Qwen3-4B-Thinking-2507 in two stages:

A) Supervised Fine-Tuning (SFT):
- All samples with $y=1$
- Batch size 256, learning rate $1 \times 10^{-5}$ , $\sim$ 5K steps to convergence
B) Reinforcement Learning (CISPO protocol):
- One epoch mixing SFT set and unsolved problems
- Batch size 256, learning rate $1 \times 10^{-6}$ , 8 rollouts per prompt
- Reward signal from o4-mini

Evaluation settings: Decoding at temperature 0.6, top-p 0.95, top-k 20, sampling $k$ solutions per prompt, reporting pass@$1$.

Benchmark Performance

Model	Params	GPQA	AIME24	AIME25	AIME26	HMMT25	HMMTNov	HLE
Qwen3-4B-Thinking-2507	4B	65.8	81.6	81.0	80.8	59.2	57.3	7.3
+ OpenScience SFT	4B	53.5	61.7	53.3	53.0	40.0	36.9	4.6
+ CHIMERA (SFT+RL)	4B	70.1	86.9	80.7	82.7	65.7	67.0	9.0
DeepSeek-R1	671B	81.0	91.4	87.5	—	79.4	—	17.7
Qwen3-235B-A22B-Thinking	235B	81.1	—	92.3	—	83.9	—	18.2

Key observations:

CHIMERA SFT alone yields improvements of up to +9.7 percentage points on held-out reasoning tasks.
SFT on OpenScience degrades performance, plausibly due to mismatched problem formats.
CHIMERA enables a 4B-parameter model to approach, and in some cases match, the performance of models up to 235B parameters on advanced benchmarks.

5. Analysis, Data Quality, and Future Prospects

Difficulty Gradient

Base Qwen3-4B achieves 88% on other synthetic sets (DAPO-Math-17K, DeepMath-103K, OpenScience) but only 37.5% on CHIMERA, indicating high intrinsic difficulty and significant learning capacity required.

Quality Assessment

Blind ranking by o4-mini and gemini-2.5-pro (200-sample comparison) shows LLM-generated problems in CHIMERA are, on average, superior to human-authored Humanity’s Last Exam samples in clarity, difficulty, and solvability.

Known Limitations

Imbalanced subject representation: Mathematics constitutes nearly half of the dataset.
Problem format: Exclusively free-response; no multiple-choice or multimodal content.
Automated validation only: Minor error detection failures may persist, as no human validation is performed.

Planned Directions

Proposed enhancements include:

Subject expansion to social sciences, engineering, and interdisciplinary areas
Human-in-the-loop annotation for sparse subfields
Inclusion of adversarial “hard negatives” for robustness
Hierarchical curriculum structures
Release of subject–topic embeddings to facilitate further research and cluster analysis

CHIMERA demonstrates that a compact synthetic dataset, coupled with rigorous LLM-driven filtering and long-form reasoning supervision, suffices to bootstrap robust reasoning capacities in mid-scale LLMs, yielding performance that rivals much larger contemporaries (Zhu et al., 1 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CHIMERA Reasoning Dataset for LLMs.

CHIMERA: Synthetic Reasoning Dataset for LLMs

1. Synthetic Dataset Construction Pipeline

Pipeline Stages

Algorithmic Specification

2. Dataset Structure and Characteristics

Chain-of-Thought Depth

Domain Coverage

Data Format

Comparative Dataset Metrics

3. Automated Quality Assurance and Contamination Control

4. Model Post-Training and Empirical Performance

Benchmark Performance

5. Analysis, Data Quality, and Future Prospects

Difficulty Gradient

Quality Assessment

Known Limitations

Planned Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CHIMERA: Synthetic Reasoning Dataset for LLMs

1. Synthetic Dataset Construction Pipeline

Pipeline Stages

Algorithmic Specification

2. Dataset Structure and Characteristics

Chain-of-Thought Depth

Domain Coverage

Data Format

Comparative Dataset Metrics

3. Automated Quality Assurance and Contamination Control

4. Model Post-Training and Empirical Performance

Benchmark Performance

5. Analysis, Data Quality, and Future Prospects

Difficulty Gradient

Quality Assessment

Known Limitations

Planned Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research