QWen2.5 32B Reasoning Augmentation
- Reasoning-Augmented QWen2.5-32B is a method that boosts reasoning accuracy using a few high-quality chain-of-thought examples.
- It fine-tunes the 32B parameter model with expert CoT traces, significantly improving metrics like pass@1 and maj@64.
- Ablative studies reveal that maintaining structural consistency in reasoning patterns is more critical than superficial keyword cues.
Reasoning-augmented QWen2.5-32B refers to a family of methodologies that enhance the reasoning abilities of the Qwen2.5-32B LLM without requiring full-scale reinforcement learning (RL) or distillation from larger models. Key approaches include targeted supervised fine-tuning on high-quality reasoning traces, ablation analyses that isolate the contribution of various data and architectural components, and empirical evaluations demonstrating that even minimal, carefully selected reasoning demonstrations can strongly activate reasoning behaviors in the base model.
1. Model Overview and Baseline Configuration
Qwen2.5-32B is a 32-billion-parameter decoder-only Transformer with rotary position embeddings, 48 layers, hidden size 8192, and 64 attention heads. The model is instruction-tuned on broad general text corpora but not specifically adapted for long chain-of-thought (CoT) reasoning. For comparison, two other configurations are relevant:
- QwQ-32B-Preview: Equally sized (32B) LLM, further trained via RL (DeepSeek-R1 style) to generate 3K–8K token CoT traces featuring backtracking and reflection. Used solely to synthesize expert-level CoT data.
- Qwen2.5-Math-72B-Instruct: A 72B parameter instruction-tuned model for mathematics tasks, primarily trained on direct and short-form solutions, lacking explicit long-form CoT.
This architecture is not altered when augmenting reasoning capabilities; all advances are achieved through data and fine-tuning protocols.
2. Data Selection: Expert CoT, Human Curation, and Control Conditions
The experimental regime centers on small, high-quality CoT datasets drawn from advanced competition math. The selection protocol is as follows:
- Seed Problem Pool: 50 AIME/HMMT-level problems (number theory, combinatorics, algebra, geometry, some calculus), all multi-step with single numeric answers.
- Expert CoT Construction: 20 problems are sampled. For each, QwQ-32B-Preview generates 512 traces (via stochastic sampling), from which the longest correct CoT trace (~3400 tokens, on average) is selected.
- Difficulty Balancing: The 20-problem set is stratified by QwQ-estimated problem pass rate: uniformly distributing “easy” (pass ≥ 0.7), “medium” (0.3–0.7), and “hard” (< 0.3).
- Formatting: Each CoT is wrapped in a prompt template (“Problem: … Let's think step by step: … Final Answer: …”) with only BOS/EOS special tokens.
Three principal control conditions are tested:
- Non-reasoning CoT: Qwen2.5-32B is fine-tuned on 2,500–5,000 short/incomplete solutions lacking explicit multi-stage reasoning.
- Human-authored CoT: Human-written solutions (50; post-processed in four rounds to insert structural cues and LLM-guided self-verification steps) are evaluated for effect.
- Few-shot prompting baseline: The same 20 expert CoT are used only as demonstrations at inference, with no parameter update.
3. Fine-Tuning Methodology
Supervised fine-tuning uses standard next-token cross-entropy loss: where are one-hot targets, and the predicted probabilities.
Key settings:
- Optimizer: AdamW (, , weight_decay=0.01)
- Learning rate:
- Batch size: or all examples if fewer
- Sequence packing: Up to 16,384 tokens
- Total steps: 50 (no early stopping)
- Infrastructure: context-model parallelism (4× data, 8× tensor) via NeMo-Aligner
- Evaluation metric: percent correct over total (pass@1) and majority@64 (maj@64) aggregated from 64 model samples per problem
Prompt engineering and structural guidance are heavily utilized:
- Each SFT example is consistently prepended (“Problem:”) and appended (“Final Answer:”).
- Scaffolding phrases (“Let’s think...,” “But wait...”) are included but not assigned special tokens.
- Human and non-reasoning data are iteratively edited—by LLMs with explicit self-check insertion—toward expert-like structure, emphasizing stepwise causal reasoning and error correction.
4. Empirical Outcomes and Ablative Analyses
Comprehensive accuracy results on the Comp-Math-24-25 (256 problems):
| Model / Data | pass@1 | maj@64 | #Ex. | Avg CoT length |
|---|---|---|---|---|
| Qwen2.5-Math-72B-Instruct (baseline) | 11.72% | 16.14% | N/A | N/A |
| Qwen2.5-32B + Expert CoT (20 ex.) | 17.10% | 27.73% | 20 | 3 444 |
| Qwen2.5-32B + Non-reasoning CoT | ~12% | ~15% | ~2500–5000 | ~1 200 |
| Qwen2.5-32B + Human-written CoT | 5–10% | 13–18% | 50 | 2 600-3 200 |
| Qwen2.5-32B few-shot (20 ex.) | 5.38% | 13.28% | – | – |
Key findings:
- Fine-tuning with only 20 expertly-synthesized long CoT examples achieves +5.38% pass@1, +11.59% maj@64 over the much larger 72B math instruct model.
- Large quantities of short, non-reasoning chains (even post-edited to include explicit verification, error-correction, etc.) plateau at pass@1~12–15%.
- Human-generated, even after intensive iterative LLM-aided editing and section structure, do not surpass 10% pass@1.
Ablations indicate:
- Solution correctness: Training on 50 expert traces with incorrect final answers achieves nearly identical performance to 50 correct traces (21.2% vs 19.3% pass@1), indicating structural exposure is more important than correctness.
- Keyword removal: Eliminating high-frequency reasoning keywords (“but wait”, “I’ll check…”) from expert CoT has minimal impact (pass@1 drops only 3%).
- Problem difficulty and diversity: Performance is nearly constant for expert sets stratified by “easy,” “medium,” or “hard,” and for different counts of unique problems versus repeated solutions.
- CoT length scaling: Increasing average CoT length from 2K→8K tokens monotonically increases accuracy (pass@1 improves from 19.7%→22.3%; maj@64 from 27.0%→37.1%).
5. Interpretation of the Structural Reasoning Signal
Analysis reveals that latent structural patterns—such as multi-stage backtracking, explicit self-verification, and hypothesis testing sequences—constitute the critical supervision signal. Performance is relatively insensitive to surface linguistic features, answer correctness, problem difficulty, or even example diversity.
- Style Homogeneity: Model traces from a single RL-trained “expert” are highly stylistically consistent, whereas human solutions are too variable.
- Backtracking and reflection: Trace features (reconsideration steps, explicit “wait, is this right?”) appear essential for knowledge transfer.
- Superficial keyword mimicry: Artificially inserting common phrases without preserving full trace structure fails to induce long reasoning behavior.
This suggests that structural alignment with the global flow of expert reasoning is a necessary (but not sufficient) condition, while lexical surface mimicry is not.
6. Limitations, Open Questions, and Transfer Considerations
Limitations
- Annotation consistency: Single-model-generated traces exhibit higher consistency; human data, even with extensive systematization, remain too heterogeneous.
- Domain specificity: Results are established on competition-style mathematics. Whether similar minimal SFT regimes suffice for tasks such as code generation, scientific reasoning, or commonsense is unproven.
- Style normalization: No explicit method for style-transfer between human and expert traces; annotator variance is a primary failure factor for human data.
Next Steps Proposed
- Development of style-normalization or curriculum learning schemes for human annotator pipelines.
- Investigation into training on partial reasoning chains (without gold final answer) to emphasize process over ergonomics.
- Integrating this minimal-SFT approach with retrieval components or tool-use routines, potentially scaling reasoning capacity in broader domains.
- Protocol development for structured, consistent human-LLM co-authored demonstrations—e.g., LLM-guided human annotation with tight structural scaffolding.
Implications
The results indicate that the “activation” of reasoning in very LLMs depends most critically on the quality and structural consistency of a very small number of expert demonstrations. This stands in contrast to prior emphasis on scale of data or model size, and points to the possibility of extremely data-efficient reasoning transfer, provided stylistic variance is controlled.
7. Conclusion
Reasoning-augmented Qwen2.5-32B can be constructed by fine-tuning on as few as 20 long, expertly generated CoT traces, yielding accuracy that surpasses much larger non-reasoning models on complex competition mathematics. The primary mechanism is the absorption of latent structural patterns typical of RL-trained expert models, rather than superficial mimicry or high-volume data. This paradigm raises new questions regarding the minimal sufficient ingredients for reasoning transfer and effective co-design of small but powerful reasoning corpora for open-domain LLMs.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free