Papers
Topics
Authors
Recent
2000 character limit reached

Qwen2.5-1.5B Transformer Model

Updated 27 November 2025
  • Qwen2.5-1.5B is a 1.5B-parameter, decoder-only transformer model designed for general language modeling and enhanced reasoning through instruction tuning and reinforcement learning.
  • It leverages advanced training techniques, including supervised fine-tuning, RL-based post-training, and knowledge distillation, to boost performance across tasks.
  • Empirical evaluations demonstrate significant improvements in math QA, code repair, and retrieval efficiency, underscoring its potential for lightweight, edge-deployable language applications.

Qwen2.5-1.5B is an open-source, decoder-only transformer LLM with approximately 1.5 billion parameters, belonging to the Qwen2.5 family. Designed as a compact yet instruction-capable LLM, Qwen2.5-1.5B incorporates architectural and training design choices that enable both general-purpose language modeling and reasonably strong reasoning capabilities, especially when augmented by recent post-training or distillation methodologies. This model serves as a foundation for research into reasoning, retrieval, code repair, multilingual instruction-following, and lightweight edge deployment, with empirical evaluations across QA, retrieval, mathematics, and specialized domains.

1. Model Architecture and Foundation

Qwen2.5-1.5B is architected as a decoder-only transformer, mechanically similar to contemporary GPT-style LLMs. Canonical configurations from the Qwen2.5 technical lineage suggest approximately 24 transformer layers, a hidden (embedding) dimension of 2048, and 16–32 attention heads per layer, with GELU activations and layer normalization (or RMSNorm in some variants) (Sun et al., 19 Sep 2025, Yang et al., 18 Sep 2024, Cruz-Castañeda et al., 20 May 2025, Zheng et al., 22 Oct 2024). Rotary position embeddings are used for positional encoding, and no architectural modifications (MoE, expert layers, etc.) are introduced at this scale. The model employs next-token prediction as its foundational pretraining objective.

Pretraining utilizes a massive, diverse text and code corpus (hundreds of billions to trillions of tokens), but the explicit corpus composition is not detailed in recent technical reports specific to 1.5B. The vocabulary size is approximately 64K BPE tokens, and a context length of 4096–8192 tokens is supported depending on downstream fine-tuning (Chen et al., 23 May 2025, Cruz-Castañeda et al., 20 May 2025).

2. Instruction Tuning and Specialization Paradigms

Instruction-tuned variants, such as Qwen2.5-1.5B-Instruct, are derived via supervised fine-tuning (SFT) on multi-task, instruction-following datasets containing synthetic or human-curated prompt-response pairs (Cruz-Castañeda et al., 20 May 2025, Chung et al., 27 May 2025, Wang et al., 21 Apr 2025). This SFT process is typically performed with standard left-to-right cross-entropy loss: L=t=1Tlogpθ(xtx<t)L = -\sum_{t=1}^T \log p_\theta(x_t \mid x_{<t}) where x<tx_{<t} is the prompt plus partial answer.

For language specialization, the Amadeus-Verbo technical report demonstrates Portuguese language adaptation by fine-tuning Qwen2.5-1.5B-Instruct on 600,000+ domain-specific instruction–response pairs, using batch sizes of 1, learning rate 1×1051 \times 10^{-5}, and batch-gradient accumulation to cover two epochs over ≈78,800 examples (Cruz-Castañeda et al., 20 May 2025).

Chain-of-thought (CoT) capabilities are further developed via SFT on data with explicit reasoning traces, or via distillation from larger LLM teachers as in SleepCoT (Zheng et al., 22 Oct 2024), or through knowledge distillation/teacher-student frameworks in DistilQwen2.5 (Wang et al., 21 Apr 2025). The distilled models retain the full 1.5B-parameter architecture and typically only replace the weights through supervised and/or KD objectives, augmented by teacher-generated rationales.

3. Reinforcement Learning and Reasoning Augmentation

Qwen2.5-1.5B supports advanced RL-based post-training paradigms, notably R1-style RL, PPO, GRPO (Group Relative Policy Optimization), and multi-stage cognitive task decompositions. The "Thinker" framework (Chung et al., 27 May 2025) stages QA as a four-phase process: Fast Thinking (concise, low-token answer under 1K tokens), Verification (internal confidence estimation), Slow Thinking (deliberative answer, up to 6K tokens), and Summarization, each with stage-specific reward structures. Policy updates use actor–critic PPO with γ=1\gamma=1, GAE-λ=1\lambda=1, clip ratio ϵ=0.2\epsilon=0.2, and no KL penalty, with 4,096 trajectories per update. Empirically, Thinker-finetuned Qwen2.5-1.5B shows an 11.9% relative accuracy improvement on math QA (27.85% vs 24.88% baseline) and achieves high inference efficiency by leveraging Fast Thinking termination (Chung et al., 27 May 2025).

In the domain of mathematical reasoning, Qwen2.5-Math-Instruct-1.5B applies a full self-improvement pipeline: pre-training on synthetic math data, iterative SFT with a reward model (RM) for sample selection, RLHF via GRPO using RM guidance, and output reranking at inference. This yields strong Chain-of-Thought and Tool-Integrated Reasoning (TIR) skills, with GSM8K accuracy up to 94.1% under RM@8 reranking (Yang et al., 18 Sep 2024).

Additionally, the "Reasoning Vector" methodology (Zbeeb et al., 1 Sep 2025) isolates RL-induced reasoning pathways as a model-parameter difference vreason=θGRPOθSFTv_{reason} = \theta_{GRPO} - \theta_{SFT}, offering a reproducible means to transfer reasoning capabilities to other compatible instruction-tuned models with zero retraining.

4. Retrieval, Code Repair, and Applied Task Performance

Qwen2.5-1.5B-Instruct is deployed in reinforced query rewriting for IR, notably within the TongSearch-QR system (Qin et al., 13 Jun 2025). Using GRPO and a semi-rule-based reward (incremental relevance via cosine similarity with a frozen embedding model), the model is fine-tuned to perform reasoning-intensive retrieval. On BRIGHT, TongSearch-QR-1.5B achieves nDCG@10=24.6 with BM25, rivaling commercial-scale LLMs at a much lower inference cost (efficiency ratio nDCG@10/cost = 2460).

In compilation repair, the model is fine-tuned (again from the "Instruct" checkpoint) via RL on a high-fidelity C++ error corpus (CCrepair) with a hybrid reward combining (a) LLM-as-a-Judge semantic correctness (0.5) and (b) GCC compilability (0.5). The result is a compact agent that matches or outperforms the 14B model in both Genuine Fix Rate and Compilation Success Rate after RL (70.8% GFR, 81.9% CSR for 1.5B vs 71.1%/78.3% for 14B) (Sun et al., 19 Sep 2025).

In domain-specialized health modeling, SleepCoT fine-tunes Qwen2.5-1.5B (using LoRA adapters) on synthetic chain-of-thought sleep reports and Q&A, achieving near parity with much larger LLMs in human-rated quality (4.7/5 mean vs 4.25 for Qwen2.5-7B) and fast on-device inference (Zheng et al., 22 Oct 2024).

5. Distillation, Model Efficiency, and Resource-Scaled Deployment

DistilQwen2.5 (Wang et al., 21 Apr 2025) introduces a two-stage distillation pipeline for Qwen2.5-1.5B:

  • Black-box knowledge distillation, using multi-agent (expansion, rewriting, verification, selection) pipelines with teacher LLMs (e.g., GPT-4o) generating augmented, chain-of-thought-labeled SFT data.
  • White-box KD (model fusion), storing per-example top-K teacher logits and aligning token distributions under temperature scaling to align student with teacher.

The resultant DistilQwen2.5-1.5B achieves significant gains over base Qwen2.5-1.5B (e.g., AlpacaEval 2.0: 13.69 vs 6.69, MT-Bench (full): 7.35 vs 7.09). No model pruning is done; the speedups arise from KD optimization and distillation at the finetuning stage.

Resource-wise, Qwen2.5-1.5B delivers compelling inference efficiency (e.g., sub-$0.01/Mtoken in TongSearch-QR-1.5B), with real-time serving on single node clusters using DeepSpeed ZeRO-3 (Qin et al., 13 Jun 2025, Zheng et al., 22 Oct 2024).

6. Empirical Limitations and Analysis of Cognitive Bias

The analysis of positional bias (Dimino et al., 25 Aug 2025) demonstrates that Qwen2.5-1.5B exhibits pronounced primacy bias in financial binary-choice tasks (effect size $r=0.87$, p < .001, all categories), with bias mechanistically traceable to specific mid-to-late transformer layers and a small set of heads. Unlike the 7B or 14B Qwen2.5 models, which partially scale out the effect, 1.5B’s bias is both large-magnitude and sharply localized. Mitigation strategies include prompt engineering (moderate framing, option randomization), architectural head regularization, and ongoing interpretability-driven bias audits.

A further limitation noted in production-level technical reports is that small-scale SFT in isolation is sample-inefficient compared with RL. The "Re-distillation" approach (Chen et al., 23 May 2025) addresses this by treating high-sample-effect RL rollouts as distilled SFT data, yielding SFT models that match RL performance (e.g., 0.82 pass@1 on Knight & Knave using just 1K RL-generated SFT samples vs full RL, at tenfold lower computational cost).

7. Benchmarks, Metrics, and Evaluation Results

Below is a summary table capturing representative results for Qwen2.5-1.5B and select distilled/fine-tuned variants across major benchmarks:

Task / Metric Qwen2.5-1.5B (Base/Instruct) DistilQwen2.5-1.5B RL/Thinker Fine-Tuned Math-Instruct-1.5B Notes
AlpacaEval 2.0 (LC) 6.69 13.69 (Wang et al., 21 Apr 2025)
MT-Bench (Full) 7.09 7.35 (Wang et al., 21 Apr 2025)
IFEval (Loose) 55.40 61.10 (Wang et al., 21 Apr 2025)
Instruction QA (Pass@1, Math QA) 3.81–24.88% 27.85% (Thinker) (Chung et al., 27 May 2025)
Comp. Repair: GFR / CSR 49.9%/63.9% (SFT base) 70.8%/81.9% (RL) (Sun et al., 19 Sep 2025)
Retrieval (nDCG@10, BRIGHT) 24.6 (Qin et al., 13 Jun 2025)
GSM8K (Math-Instruct, RM@8) 94.1% (Yang et al., 18 Sep 2024)
MATH (Math-Instruct, RM@8) 83.9% (Yang et al., 18 Sep 2024)

Performance boosts from RL/post-hoc reasoning augmentation (e.g., the "reasoning vector" improves GSM8K by +4.9% absolute for 1.5B) establish that such methods are effective even at this modest scale, subject to alignment of architecture and tokenizer (Zbeeb et al., 1 Sep 2025).

References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-1.5B Model.