Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenThinker3-7B Model

Updated 26 February 2026
  • OpenThinker3-7B is a 7-billion-parameter open-source reasoning model that uses a tailored dataset and systematic supervised fine-tuning to excel in benchmark tasks.
  • The model employs the Qwen-2.5-7B-Instruct transformer architecture with 32 layers and 4096 hidden units, optimized solely for reasoning without additional innovations.
  • Benchmark results show a +12.4 percentage point improvement over comparable models, demonstrating the efficacy of 16-answer augmentation and precise data filtering methods.

OpenThinker3-7B is a 7-billion-parameter open-source reasoning model, designed as a high-performing, fully auditable alternative to proprietary reasoning systems. Developed as part of the OpenThoughts initiative, it demonstrates state-of-the-art results on leading math, code, and science reasoning benchmarks using only public data. The model is primarily a supervised fine-tune of Qwen-2.5-7B-Instruct, with its performance driven by a specifically engineered dataset generation pipeline and systematic distillation strategy from a larger teacher model. All artifacts—datasets, weights, prompts, training code—are released under open licenses.

1. Model Architecture and Capacity

OpenThinker3-7B adopts the unmodified Qwen-2.5-7B-Instruct transformer architecture, comprising 32 transformer layers, each with a hidden size of 4096, 32 multi-head self-attention heads, and rotary positional embeddings. No architectural innovations such as Mixture-of-Experts or retrieval augmentations are introduced. The model’s full capacity is reserved for reasoning tasks, with all training and inference architectural hyperparameters retained from the base model (Guha et al., 4 Jun 2025).

Attribute OpenThinker3-7B Value Notes
Parameter count 7B
Layers 32 Transformer
Hidden size 4096
Attention heads 32
RoPE Yes All layers

2. Dataset Construction and Data Pipeline

The OpenThoughts3-1.2M dataset, central to OpenThinker3-7B’s performance, consists of 1,200,000 reasoning examples: 850,000 math questions, 250,000 code questions, and 100,000 science questions. Each dataset item is a triplet containing a prompt, a chain-of-thought (CoT), and an answer.

The pipeline is exhaustively tuned through over 1,000 controlled ablation experiments at 31.6k-scale, informing optimal choices at each stage:

  • Question sourcing: Selects top two data sources per domain rather than maximizing diversity. Math examples use OpenMath-2-Math; code uses StackExchange CodeGolf and OpenCodeReasoning; science uses Physics.SE and organic chemistry PDFs.
  • Filtering: Math and science questions undergo LLM-based "response-length" filtering using GPT-4.1-mini; code questions are filtered for difficulty by GPT-4o-mini based on ICPC rubric.
  • Deduplication and answer sampling: Math and science domains apply exact string deduplication; code domain skips deduplication. For all, sixteen distinct answers per question are sampled from the teacher for SFT augmentation.
  • Answer filtering: All samples from the teacher are used; empirical testing shows that strategies like LLM verification or unit tests do not improve performance.
  • Teacher selection: QwQ-32B, a 32B RL-trained reasoning model, is used for generating all CoT traces and answers.
  • Decontamination: To prevent eval data leakage, questions overlapping with eval benchmarks by ≥75% Indel or any shared 13-gram are excluded, where

indel_sim(s1,s2)=100×LCS(s1,s2)max(s1,s2).\mathrm{indel\_sim}(s_1,s_2) =100\times \frac{\mathrm{LCS}(s_1,s_2)}{\max(|s_1|,|s_2|)}\,.

3. Distillation Strategy and Training Protocol

OpenThinker3-7B is trained via direct supervised fine-tuning of Qwen-2.5-7B-Instruct on the OpenThoughts3 dataset using a straightforward cross-entropy loss over the teacher’s generated completions (CoT and answer tokens). No reinforcement learning or value-head losses are involved. Classic knowledge distillation with temperature scaling was considered, but outperformed by vanilla cross-entropy for CoT distillation.

Key large-set regime training parameters include AdamW optimizer (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, no weight decay), cosine decay learning rate scheduling with 10% linear warmup, peak learning rate 8×1058 \times 10^{-5}, batch size 512, five epochs, and sequence packing via greedy algorithms. Training involves 25,000 A100-GPU hours, with additional 22,000 H100-GPU hours used for data annotation via QwQ-32B (Guha et al., 4 Jun 2025).

4. Benchmark Results and Comparative Performance

OpenThinker3-7B is evaluated with Evalchemy across 12 benchmarks, including advanced mathematical, coding, and science tasks (AIME 2025, LiveCodeBench 06/24-01/25, and GPQA Diamond). Reported metrics are mean accuracy with standard errors not exceeding 0.7 percentage points. The model achieves:

  • 53.3% on AIME 2025
  • 51.7% on LiveCodeBench 06/24-01/25
  • 53.7% on GPQA Diamond

These results represent a +12.4 percentage point average improvement over DeepSeek-R1-Distill-Qwen-7B, and +2.1 points over the next-best open-data 8B model. Sampling multiple answers per question (16× augmentation) is identified as a driver of substantial downstream gains. Notably, supervised fine-tuning on QwQ-32B outputs outperforms using a nominally "stronger" teacher on held-out benchmarks (Guha et al., 4 Jun 2025).

Model AIME 25 LCB 06/24–01/25 GPQA-D Avg.
OpenThinker3-7B 53.3 51.7 53.7 55.3
DeepSeek-R1-Distill-Qwen-7B 39.7 30.7 24.6 24.0
Next-best open-data 8B model 50.7 44.3 52.9 52.9

5. Key Methodological Findings and Ablation Insights

Several outcomes from pipeline ablations and controlled experiments delineate data-centric best practices for SFT-based reasoning models:

  • Utilizing 16 sampled answers per question for SFT scaling produces large accuracy gains with minimal complexity.
  • Restricting data mixing to top 1–2 sources per domain, rather than maximizing diversity, improves overall dataset quality.
  • LLM-based question filtering strategies outperform classical text embedding or classifier (e.g., FastText) approaches.
  • Contrary to prior art, further answer filtering (using LLM verification, unit tests, or majority-vote) does not yield performance benefits; all generated answers are used directly.
  • The empirically optimal teacher for SFT is not necessarily the highest-scoring model on the target benchmarks; QwQ-32B produces better SFT result than higher-ranked DeepSeek-R1 (Guha et al., 4 Jun 2025).

6. Limitations and Open Directions

OpenThinker3-7B surfaces several unresolved topics and limitations:

  • The effect of tailoring the data-generation recipe to each domain versus cross-domain averaging remains unexplored.
  • It is uncertain whether further scaling of data and model size leads to diminishing returns, especially as student model performance converges toward that of the teacher.
  • The interaction between dataset scale, answer diversity, and question diversity at larger scales is not fully understood.
  • All OpenThinker generations to date show a degradation in safety alignment; approaches to reinforce safety without impairing reasoning capability remain an open challenge.
  • The potential for reinforcement learning from human feedback (RLHF) or actor-critic distillation to further improve performance is a subject for future research (Guha et al., 4 Jun 2025).

7. Release, Reproducibility, and Community Impact

Full data artifacts (OpenThoughts3-1.2M), model weights, prompts, and training code for OpenThinker3-7B are publicly released at https://openthoughts.ai, enabling thorough reproducibility and facilitating targeted interventions in model and data recipes. This open release model directly addresses opacity in state-of-the-art reasoning research, providing a foundation for further benchmarking, scaling, and methodological innovation in large-scale, open-source reasoning systems (Guha et al., 4 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenThinker3-7B Model.