Qwen-4B: Concept and Context
- Qwen-4B is a conceptual checkpoint in the Qwen language model series, defined by its theoretical derivation from the Qwen-7B model.
- It exemplifies advanced transformer design with 32 layers, rotary positional embeddings, and mixed precision training for efficient computation.
- Qwen models demonstrate competitive performance in reasoning, mathematics, and code tasks, setting a high bar among open-source LLMs.
Qwen-4B is not an enumerated or released checkpoint within the Qwen LLM series as described by Bai et al. in the "Qwen Technical Report" (Bai et al., 2023). The Qwen series comprises base LLMs of 1.8 billion (Qwen-1.8B), 7 billion (Qwen-7B), and 14 billion (Qwen-14B) parameters. For coverage of medium-scale methodology and performance, the Qwen-7B configuration is the closest available and functions as the basis for technical discussion in the absence of a formally released Qwen-4B variant.
1. Model Overview and Architecture
The Qwen-7B model is a member of the Qwen LLM series, based on a Transformer architecture with 32 layers (), a hidden dimension of , and 32 attention heads per self-attention block. The context window for input sequences is 2048 tokens.
Each Transformer block consists of multi-head self-attention with rotary positional embeddings (RoPE) and NTK-aware interpolation, a feed-forward network with SwiGLU activation, and a feed-forward size of . Normalization is performed with a combination of Pre-LayerNorm and RMSNorm. The attention mechanism follows
where , , and are produced by learned projections of the input with .
2. Training Corpus and Data Processing
Pretraining for Qwen models uses a corpus comprising approximately 3 trillion tokens. The data includes deduplicated public web text, multilingual content with a primary focus on English and Chinese, code, books, encyclopedias, and high-quality instruction-style data. Rigorous filtering and deduplication are applied (including HTML extraction, language detection, exact and fuzzy deduplication, rule-based and model-based quality filtering), and high-quality data sources are up-sampled. Instructional content is filtered to prevent overlap with evaluation benchmarks.
3. Optimization and Pretraining Procedure
Qwen-7B is trained with the AdamW optimizer (hyperparameters: , , , weight decay ), with a peak learning rate of decayed by a cosine schedule to the peak. Training proceeds with global token batches of 4 million, in mixed BFloat16 precision utilizing FlashAttention kernels. The 7B model sees up to 2.4 trillion tokens. Training infrastructure is based on multi-node GPU clusters (e.g., NVIDIA A100), employing both model and data parallelism; the precise hardware allocation remains proprietary.
4. Alignment and Chat Model Adaptation
After pretraining, the base models are adapted into Qwen-Chat models via supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) using Proximal Policy Optimization (PPO). SFT is conducted on 2048-token ChatML-formatted dialogs (batch size 128, 4000 steps, peak learning rate with AdamW, linear warmup for 1430 steps, weight decay , dropout , gradient clipping ). RLHF leverages a reward model trained from paired human-preference data (batch size 64, learning rate ), with PPO policy/value learning rates at /, KL penalty , and value loss clipping .
5. Evaluation and Comparative Performance
Performance evaluations (primarily for Qwen-7B, serving as a proxy for any hypothetical Qwen-4B) employ 5-shot settings across knowledge, reasoning, mathematics, and code tasks:
| Task | Qwen-7B (%) | Comparator 1 (%) | Comparator 2 (%) |
|---|---|---|---|
| MMLU | 58.2 | LLaMA-7B: 35.6 | Baichuan2-7B: 54.7 |
| C-Eval | 63.5 | LLaMA2-7B: 32.5 | ChatGLM2-6B: 51.7 |
| GSM8K | 51.7 | LLaMA-7B: 11.0 | InternLM-7B: 31.2 |
| MATH | 11.6 | LLaMA-7B: 2.9 | Baichuan2-7B: 5.6 |
| HumanEval | 29.9 | CodeLLaMA-7B: 33.5 | StarCoder-15B: 40.8 |
| MBPP | 31.6 | CodeLLaMA-7B: 41.4 | StarCoder-15B: 49.5 |
| BBH | 45.0 | LLaMA2-7B: 38.2 | Falcon-7B: 28.0 |
Qwen-7B-Chat achieves 98% tool-selection accuracy and approximately 70% all-task executable-code rate in code-interpreter benchmarks, compared to GPT-3.5’s 85%/72.9% and GPT-4’s 95%/86.8%.
6. Specialized Derivatives: Coding and Mathematics
Qwen also serves as the backbone for coding-specialized (Code-Qwen, Code-Qwen-Chat) and mathematics-focused (Math-Qwen-Chat) models. These variants are derived from base Qwen models via further domain adaptation and exhibit performance considerably improved over open-source competitors but somewhat below proprietary systems.
7. Contextualization and Model Positioning
Among open-source models at the 7B-parameter scale, Qwen-7B demonstrates state-of-the-art or near state-of-the-art results across language, reasoning, math, and code tasks, consistently exceeding the performance of LLaMA, Falcon, MPT, Baichuan2, and InternLM. The scaling trajectory exemplified by Qwen-14B narrows the gap to proprietary LLMs such as GPT-3.5 and GPT-4, while maintaining open-source access at its capability level. The absence of a Qwen-4B variant in the published release reflects the model’s strategy of targeting distinct parameter milestones to maximize tradeoffs between corpus size, training budget, and downstream task coverage (Bai et al., 2023).