Papers
Topics
Authors
Recent
Search
2000 character limit reached

CHIMERA: Synthetic Reasoning Dataset for LLMs

Updated 1 May 2026
  • The paper introduces a fully automated LLM-driven pipeline that performs subject expansion, problem generation, and solution synthesis to create high-quality reasoning data.
  • The dataset comprises 9,225 samples with extensive chain-of-thought trajectories (average solution length over 11K tokens) covering 8 disciplines.
  • CHIMERA enables a 4B-parameter model to achieve performance comparable to models up to 235B parameters by addressing data cold-start and annotation bottlenecks.

CHIMERA is a compact, synthetic reasoning dataset designed to advance generalizable, cross-domain reasoning in LLMs. Comprising 9,225 samples, each with long-form chain-of-thought (CoT) trajectories, it addresses major data-centric challenges such as the cold-start problem in seed data, limited subject coverage in existing open datasets, and the annotation bottleneck for high-level reasoning. Its pipeline is fully automated, leveraging multiple state-of-the-art LLMs for generation, validation, and verification, with no human annotation. Despite its modest size, CHIMERA effectively boosts the reasoning capabilities of 4B-parameter models to levels close to models two orders of magnitude larger, as evidenced by strong benchmark results (Zhu et al., 1 Mar 2026).

1. Synthetic Dataset Construction Pipeline

CHIMERA’s generation process consists of a fully automated three-stage pipeline, with all stages handled by LLM APIs. The pipeline executes as follows:

Pipeline Stages

  1. Subject Expansion: Starting from 8 coarse-grained disciplines (Mathematics, Physics, Chemistry, Biology, Computer Science, Literature, History, Linguistics), gpt-5 expands each subject into 1,179 unique, fine-grained topics using prompt-based sampling.
  2. Problem Generation: For each topic, gpt-5 is prompted to generate PhD-level problems and reference answers. Each (question, answer) pair is string-deduplicated.
  3. Solution Synthesis: Qwen3-235B-A22B-Thinking-2507 generates detailed CoT solutions. Validation is performed by two independent LLMs (gpt-5, o4-mini), filtering for well-posedness and correctness.

Algorithmic Specification

The formal synthesis pipeline is represented in the following pseudocode (Algorithm 1):

Scoren=1TtiTmaxsSGn(ti)Gn(s)Gn(ti)Gn(s),\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t_i\in\mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t_i)\cap G_n(s)|}{|G_n(t_i)\cup G_n(s)|},0

A two-level taxonomy is generated entirely by LLM prompts. No external clustering or manual curation is performed.

2. Dataset Structure and Characteristics

CHIMERA’s architecture embodies three central properties: rich chain-of-thought trajectories, broad and structured subject coverage, and rigorous, fully-automated quality control.

Chain-of-Thought Depth

  • Solution length: Mean 11,121 tokens per solution (substantially longer than GSM8K's 205 words).
  • Prompt length: Average 211 words, supporting extended multi-step reasoning processes.

Domain Coverage

  • Disciplines: 8 high-level fields.
  • Topics: 1,179 fine-grained model-generated topics.
  • Distribution: Mathematics constitutes 48.3% of the data.

Data Format

All records adhere to a standardized JSON schema: Scoren=1TtiTmaxsSGn(ti)Gn(s)Gn(ti)Gn(s),\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t_i\in\mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t_i)\cap G_n(s)|}{|G_n(t_i)\cup G_n(s)|},1

Comparative Dataset Metrics

Dataset #Problems Subjects Prompt Len Solu. Len
GSM8K 7,473 1 45.1 51.7
MATH 7,500 1 33.0 89.5
OpenScience 315,579 76.1 1,296.8
CHIMERA 9,225 8 211.1 11,121.4

Quality control is enforced strictly through consensus of LLM validators and verifiers.

3. Automated Quality Assurance and Contamination Control

Validation and verification of all entries in CHIMERA are performed via an LLM-based cross-model protocol:

  • Problem validity: Both gpt-5 and o4-mini verdicts of "VALID" are mandatory.
  • Solution correctness: o4-mini serves as the answer extractor and judge, flagging "correct: yes/no."
  • Sample retention: Only samples with dual VALID validation and correct CoT pass the filters; otherwise, they are discarded or reserved for RL training.

Mathematically, the filter is: Keep sample    Vgpt5(q,a)=VALIDVo4(q,a)=VALIDCo4(q,a,r)=yes.\text{Keep sample} \iff V_{\mathrm{gpt5}}(q,a)=\text{VALID} \wedge V_{\mathrm{o4}}(q,a)=\text{VALID} \wedge C_{\mathrm{o4}}(q,a,r)=\text{yes}.

Contamination is assessed using Jaccard n-gram overlap scores: Scoren=1TtiTmaxsSGn(ti)Gn(s)Gn(ti)Gn(s),\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t_i\in\mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t_i)\cap G_n(s)|}{|G_n(t_i)\cup G_n(s)|}, where Gn()G_n(\cdot) extracts distinct nn-grams. Overlap with major public benchmarks is essentially zero.

4. Model Post-Training and Empirical Performance

CHIMERA is applied to post-train Qwen3-4B-Thinking-2507 in two stages:

  • A) Supervised Fine-Tuning (SFT):
    • All samples with y=1y=1
    • Batch size 256, learning rate 1×1051 \times 10^{-5}, \sim5K steps to convergence
  • B) Reinforcement Learning (CISPO protocol):
    • One epoch mixing SFT set and unsolved problems
    • Batch size 256, learning rate 1×1061 \times 10^{-6}, 8 rollouts per prompt
    • Reward signal from o4-mini

Evaluation settings: Decoding at temperature 0.6, top-p 0.95, top-k 20, sampling kk solutions per prompt, reporting pass@$1$.

Benchmark Performance

Model Params GPQA AIME24 AIME25 AIME26 HMMT25 HMMTNov HLE
Qwen3-4B-Thinking-2507 4B 65.8 81.6 81.0 80.8 59.2 57.3 7.3
+ OpenScience SFT 4B 53.5 61.7 53.3 53.0 40.0 36.9 4.6
+ CHIMERA (SFT+RL) 4B 70.1 86.9 80.7 82.7 65.7 67.0 9.0
DeepSeek-R1 671B 81.0 91.4 87.5 79.4 17.7
Qwen3-235B-A22B-Thinking 235B 81.1 92.3 83.9 18.2

Key observations:

  • CHIMERA SFT alone yields improvements of up to +9.7 percentage points on held-out reasoning tasks.
  • SFT on OpenScience degrades performance, plausibly due to mismatched problem formats.
  • CHIMERA enables a 4B-parameter model to approach, and in some cases match, the performance of models up to 235B parameters on advanced benchmarks.

5. Analysis, Data Quality, and Future Prospects

Difficulty Gradient

Base Qwen3-4B achieves 88% on other synthetic sets (DAPO-Math-17K, DeepMath-103K, OpenScience) but only 37.5% on CHIMERA, indicating high intrinsic difficulty and significant learning capacity required.

Quality Assessment

Blind ranking by o4-mini and gemini-2.5-pro (200-sample comparison) shows LLM-generated problems in CHIMERA are, on average, superior to human-authored Humanity’s Last Exam samples in clarity, difficulty, and solvability.

Known Limitations

  • Imbalanced subject representation: Mathematics constitutes nearly half of the dataset.
  • Problem format: Exclusively free-response; no multiple-choice or multimodal content.
  • Automated validation only: Minor error detection failures may persist, as no human validation is performed.

Planned Directions

Proposed enhancements include:

  • Subject expansion to social sciences, engineering, and interdisciplinary areas
  • Human-in-the-loop annotation for sparse subfields
  • Inclusion of adversarial “hard negatives” for robustness
  • Hierarchical curriculum structures
  • Release of subject–topic embeddings to facilitate further research and cluster analysis

CHIMERA demonstrates that a compact synthetic dataset, coupled with rigorous LLM-driven filtering and long-form reasoning supervision, suffices to bootstrap robust reasoning capacities in mid-scale LLMs, yielding performance that rivals much larger contemporaries (Zhu et al., 1 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CHIMERA Reasoning Dataset for LLMs.