OpenThinker3-7B Model

Updated 26 February 2026

OpenThinker3-7B is a 7-billion-parameter open-source reasoning model that uses a tailored dataset and systematic supervised fine-tuning to excel in benchmark tasks.
The model employs the Qwen-2.5-7B-Instruct transformer architecture with 32 layers and 4096 hidden units, optimized solely for reasoning without additional innovations.
Benchmark results show a +12.4 percentage point improvement over comparable models, demonstrating the efficacy of 16-answer augmentation and precise data filtering methods.

OpenThinker3-7B is a 7-billion-parameter open-source reasoning model, designed as a high-performing, fully auditable alternative to proprietary reasoning systems. Developed as part of the OpenThoughts initiative, it demonstrates state-of-the-art results on leading math, code, and science reasoning benchmarks using only public data. The model is primarily a supervised fine-tune of Qwen-2.5-7B-Instruct, with its performance driven by a specifically engineered dataset generation pipeline and systematic distillation strategy from a larger teacher model. All artifacts—datasets, weights, prompts, training code—are released under open licenses.

1. Model Architecture and Capacity

OpenThinker3-7B adopts the unmodified Qwen-2.5-7B-Instruct transformer architecture, comprising 32 transformer layers, each with a hidden size of 4096, 32 multi-head self-attention heads, and rotary positional embeddings. No architectural innovations such as Mixture-of-Experts or retrieval augmentations are introduced. The model’s full capacity is reserved for reasoning tasks, with all training and inference architectural hyperparameters retained from the base model (Guha et al., 4 Jun 2025).

Attribute	OpenThinker3-7B Value	Notes
Parameter count	7B
Layers	32	Transformer
Hidden size	4096
Attention heads	32
RoPE	Yes	All layers

2. Dataset Construction and Data Pipeline

The OpenThoughts3-1.2M dataset, central to OpenThinker3-7B’s performance, consists of 1,200,000 reasoning examples: 850,000 math questions, 250,000 code questions, and 100,000 science questions. Each dataset item is a triplet containing a prompt, a chain-of-thought (CoT), and an answer.

The pipeline is exhaustively tuned through over 1,000 controlled ablation experiments at 31.6k-scale, informing optimal choices at each stage:

Question sourcing: Selects top two data sources per domain rather than maximizing diversity. Math examples use OpenMath-2-Math; code uses StackExchange CodeGolf and OpenCodeReasoning; science uses Physics.SE and organic chemistry PDFs.
Filtering: Math and science questions undergo LLM-based "response-length" filtering using GPT-4.1-mini; code questions are filtered for difficulty by GPT-4o-mini based on ICPC rubric.
Deduplication and answer sampling: Math and science domains apply exact string deduplication; code domain skips deduplication. For all, sixteen distinct answers per question are sampled from the teacher for SFT augmentation.
Answer filtering: All samples from the teacher are used; empirical testing shows that strategies like LLM verification or unit tests do not improve performance.
Teacher selection: QwQ-32B, a 32B RL-trained reasoning model, is used for generating all CoT traces and answers.
Decontamination: To prevent eval data leakage, questions overlapping with eval benchmarks by ≥75% Indel or any shared 13-gram are excluded, where

$\mathrm{indel\_sim}(s_1,s_2) =100\times \frac{\mathrm{LCS}(s_1,s_2)}{\max(|s_1|,|s_2|)}\,.$

3. Distillation Strategy and Training Protocol

OpenThinker3-7B is trained via direct supervised fine-tuning of Qwen-2.5-7B-Instruct on the OpenThoughts3 dataset using a straightforward cross-entropy loss over the teacher’s generated completions (CoT and answer tokens). No reinforcement learning or value-head losses are involved. Classic knowledge distillation with temperature scaling was considered, but outperformed by vanilla cross-entropy for CoT distillation.

Key large-set regime training parameters include AdamW optimizer ( $\beta_1=0.9$ , $\beta_2=0.999$ , no weight decay), cosine decay learning rate scheduling with 10% linear warmup, peak learning rate $8 \times 10^{-5}$ , batch size 512, five epochs, and sequence packing via greedy algorithms. Training involves 25,000 A100-GPU hours, with additional 22,000 H100-GPU hours used for data annotation via QwQ-32B (Guha et al., 4 Jun 2025).

4. Benchmark Results and Comparative Performance

OpenThinker3-7B is evaluated with Evalchemy across 12 benchmarks, including advanced mathematical, coding, and science tasks (AIME 2025, LiveCodeBench 06/24-01/25, and GPQA Diamond). Reported metrics are mean accuracy with standard errors not exceeding 0.7 percentage points. The model achieves:

53.3% on AIME 2025
51.7% on LiveCodeBench 06/24-01/25
53.7% on GPQA Diamond

These results represent a +12.4 percentage point average improvement over DeepSeek-R1-Distill-Qwen-7B, and +2.1 points over the next-best open-data 8B model. Sampling multiple answers per question (16× augmentation) is identified as a driver of substantial downstream gains. Notably, supervised fine-tuning on QwQ-32B outputs outperforms using a nominally "stronger" teacher on held-out benchmarks (Guha et al., 4 Jun 2025).

Model	AIME 25	LCB 06/24–01/25	GPQA-D	Avg.
OpenThinker3-7B	53.3	51.7	53.7	55.3
DeepSeek-R1-Distill-Qwen-7B	39.7	30.7	24.6	24.0
Next-best open-data 8B model	50.7	44.3	52.9	52.9

5. Key Methodological Findings and Ablation Insights

Several outcomes from pipeline ablations and controlled experiments delineate data-centric best practices for SFT-based reasoning models:

Utilizing 16 sampled answers per question for SFT scaling produces large accuracy gains with minimal complexity.
Restricting data mixing to top 1–2 sources per domain, rather than maximizing diversity, improves overall dataset quality.
LLM-based question filtering strategies outperform classical text embedding or classifier (e.g., FastText) approaches.
Contrary to prior art, further answer filtering (using LLM verification, unit tests, or majority-vote) does not yield performance benefits; all generated answers are used directly.
The empirically optimal teacher for SFT is not necessarily the highest-scoring model on the target benchmarks; QwQ-32B produces better SFT result than higher-ranked DeepSeek-R1 (Guha et al., 4 Jun 2025).

6. Limitations and Open Directions

OpenThinker3-7B surfaces several unresolved topics and limitations:

The effect of tailoring the data-generation recipe to each domain versus cross-domain averaging remains unexplored.
It is uncertain whether further scaling of data and model size leads to diminishing returns, especially as student model performance converges toward that of the teacher.
The interaction between dataset scale, answer diversity, and question diversity at larger scales is not fully understood.
All OpenThinker generations to date show a degradation in safety alignment; approaches to reinforce safety without impairing reasoning capability remain an open challenge.
The potential for reinforcement learning from human feedback (RLHF) or actor-critic distillation to further improve performance is a subject for future research (Guha et al., 4 Jun 2025).

7. Release, Reproducibility, and Community Impact

Full data artifacts (OpenThoughts3-1.2M), model weights, prompts, and training code for OpenThinker3-7B are publicly released at https://openthoughts.ai, enabling thorough reproducibility and facilitating targeted interventions in model and data recipes. This open release model directly addresses opacity in state-of-the-art reasoning research, providing a foundation for further benchmarking, scaling, and methodological innovation in large-scale, open-source reasoning systems (Guha et al., 4 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

OpenThoughts: Data Recipes for Reasoning Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenThinker3-7B Model.