Papers
Topics
Authors
Recent
2000 character limit reached

Qwen-4B: Concept and Context

Updated 23 December 2025
  • Qwen-4B is a conceptual checkpoint in the Qwen language model series, defined by its theoretical derivation from the Qwen-7B model.
  • It exemplifies advanced transformer design with 32 layers, rotary positional embeddings, and mixed precision training for efficient computation.
  • Qwen models demonstrate competitive performance in reasoning, mathematics, and code tasks, setting a high bar among open-source LLMs.

Qwen-4B is not an enumerated or released checkpoint within the Qwen LLM series as described by Bai et al. in the "Qwen Technical Report" (Bai et al., 2023). The Qwen series comprises base LLMs of 1.8 billion (Qwen-1.8B), 7 billion (Qwen-7B), and 14 billion (Qwen-14B) parameters. For coverage of medium-scale methodology and performance, the Qwen-7B configuration is the closest available and functions as the basis for technical discussion in the absence of a formally released Qwen-4B variant.

1. Model Overview and Architecture

The Qwen-7B model is a member of the Qwen LLM series, based on a Transformer architecture with 32 layers (L=32L=32), a hidden dimension of dmodel=4096d_\mathrm{model}=4096, and 32 attention heads per self-attention block. The context window for input sequences is 2048 tokens.

Each Transformer block consists of multi-head self-attention with rotary positional embeddings (RoPE) and NTK-aware interpolation, a feed-forward network with SwiGLU activation, and a feed-forward size of (8/3)×dmodel(8/3) \times d_\mathrm{model}. Normalization is performed with a combination of Pre-LayerNorm and RMSNorm. The attention mechanism follows

Attention(Q,K,V)=softmax(QK⊤dk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

where QQ, KK, and VV are produced by learned projections of the input with W∈Rd×dkhW \in \mathbb{R}^{d \times d_{kh}}.

2. Training Corpus and Data Processing

Pretraining for Qwen models uses a corpus comprising approximately 3 trillion tokens. The data includes deduplicated public web text, multilingual content with a primary focus on English and Chinese, code, books, encyclopedias, and high-quality instruction-style data. Rigorous filtering and deduplication are applied (including HTML extraction, language detection, exact and fuzzy deduplication, rule-based and model-based quality filtering), and high-quality data sources are up-sampled. Instructional content is filtered to prevent overlap with evaluation benchmarks.

3. Optimization and Pretraining Procedure

Qwen-7B is trained with the AdamW optimizer (hyperparameters: β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, ϵ=1×10−8\epsilon=1 \times 10^{-8}, weight decay =0=0), with a peak learning rate of 3×10−43 \times 10^{-4} decayed by a cosine schedule to 0.1×0.1 \times the peak. Training proceeds with global token batches of 4 million, in mixed BFloat16 precision utilizing FlashAttention kernels. The 7B model sees up to 2.4 trillion tokens. Training infrastructure is based on multi-node GPU clusters (e.g., NVIDIA A100), employing both model and data parallelism; the precise hardware allocation remains proprietary.

4. Alignment and Chat Model Adaptation

After pretraining, the base models are adapted into Qwen-Chat models via supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) using Proximal Policy Optimization (PPO). SFT is conducted on 2048-token ChatML-formatted dialogs (batch size 128, 4000 steps, peak learning rate 2×10−62 \times 10^{-6} with AdamW, linear warmup for 1430 steps, weight decay =0.1=0.1, dropout =0.1=0.1, gradient clipping =1.0=1.0). RLHF leverages a reward model trained from paired human-preference data (batch size 64, learning rate 3×10−63 \times 10^{-6}), with PPO policy/value learning rates at 1×10−61 \times 10^{-6}/5×10−65 \times 10^{-6}, KL penalty =0.04=0.04, and value loss clipping =0.15=0.15.

5. Evaluation and Comparative Performance

Performance evaluations (primarily for Qwen-7B, serving as a proxy for any hypothetical Qwen-4B) employ 5-shot settings across knowledge, reasoning, mathematics, and code tasks:

Task Qwen-7B (%) Comparator 1 (%) Comparator 2 (%)
MMLU 58.2 LLaMA-7B: 35.6 Baichuan2-7B: 54.7
C-Eval 63.5 LLaMA2-7B: 32.5 ChatGLM2-6B: 51.7
GSM8K 51.7 LLaMA-7B: 11.0 InternLM-7B: 31.2
MATH 11.6 LLaMA-7B: 2.9 Baichuan2-7B: 5.6
HumanEval 29.9 CodeLLaMA-7B: 33.5 StarCoder-15B: 40.8
MBPP 31.6 CodeLLaMA-7B: 41.4 StarCoder-15B: 49.5
BBH 45.0 LLaMA2-7B: 38.2 Falcon-7B: 28.0

Qwen-7B-Chat achieves 98% tool-selection accuracy and approximately 70% all-task executable-code rate in code-interpreter benchmarks, compared to GPT-3.5’s 85%/72.9% and GPT-4’s 95%/86.8%.

6. Specialized Derivatives: Coding and Mathematics

Qwen also serves as the backbone for coding-specialized (Code-Qwen, Code-Qwen-Chat) and mathematics-focused (Math-Qwen-Chat) models. These variants are derived from base Qwen models via further domain adaptation and exhibit performance considerably improved over open-source competitors but somewhat below proprietary systems.

7. Contextualization and Model Positioning

Among open-source models at the 7B-parameter scale, Qwen-7B demonstrates state-of-the-art or near state-of-the-art results across language, reasoning, math, and code tasks, consistently exceeding the performance of LLaMA, Falcon, MPT, Baichuan2, and InternLM. The scaling trajectory exemplified by Qwen-14B narrows the gap to proprietary LLMs such as GPT-3.5 and GPT-4, while maintaining open-source access at its capability level. The absence of a Qwen-4B variant in the published release reflects the model’s strategy of targeting distinct parameter milestones to maximize tradeoffs between corpus size, training budget, and downstream task coverage (Bai et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Qwen-4B.