Qwen3-8B-Base: Open 8B LLM Overview

Updated 3 December 2025

Qwen3-8B-Base is an open, dense 8B parameter decoder-only transformer built for versatile multilingual and reasoning-intensive tasks.
It employs advanced techniques like Rotary Position Embeddings, YARN/DCA for long-context support, and dual inference modes with chain-of-thought reasoning.
Benchmark results demonstrate competitive performance on tasks such as GSM8K and MATH, underpinned by a robust pretraining regimen spanning 36 trillion tokens across 119 languages.

Qwen3-8B-Base is an open, dense LLM comprising 8 billion parameters, designed for versatile multilingual and reasoning-intensive tasks. It serves as a foundational member of the Qwen3 series, integrating flexible inference modes and supporting advanced deployment settings, with performance competitive against both contemporary open and closed-source models (Yang et al., 14 May 2025).

1. Architectural Features

Qwen3-8B-Base is a standard decoder-only transformer, consisting of 36 blocks, each with Group-Query Attention: 32 query heads and 8 key/value heads per block. It uses a head dimension of 128, implying a hidden size of 4096. The model employs SwiGLU nonlinearity in its MLP components and RMSNorm with pre-normalization. Positional information is encoded with Rotary Position Embeddings (RoPE), extended to base frequency 1,000,000 for long-context support, and leveraging Attention Base Frequency (ABF) techniques. Tokenization is performed by a byte-level BPE with a vocabulary of 151,669 tokens. Qwen3-8B-Base does not tie input and output embeddings, and supports context lengths of up to 128,000 tokens via YARN and Dual-Chunk Attention in inference (Yang et al., 14 May 2025).

Property	Value	Comment
Model size	8B parameters	Dense, non-MoE
Layers/Heads	36 layers, 32q/8k-v heads/block	Group-Query Attention
Hidden size	~4096	Inferred from head config
Tokenizer	BPE, 151,669 vocab	Byte-level
Positional Encoding	RoPE (base 1,000,000), ABF	Long-context support
Context Length	up to 128K tokens	YARN+DCA enabled

2. Training Regimen and Data

Pretraining spans 36 trillion tokens covering 119 languages and dialects, sourced from web crawls, OCR-extracted PDFs, Wikipedia, code repositories, and synthetic STEM/coding datasets. The process is divided into three stages:

General stage: 30T tokens, sequence length 4096, broad mixture.
Reasoning stage: 5T high-quality STEM/code tokens, seq_len 4096, sharp LR decay.
Long-context stage: hundreds of billions of tokens, seq_len up to 32,768 (75% in [16K, 32K], 25% in [4K, 16K]), using extended RoPE, YARN, and DCA.

Data mixture weights are optimized via instance-level labeling and ablations. AdamW is used as the optimizer, and loss is standard autoregressive cross-entropy. Scaling law predictions (Chinchilla/Hoffmann/Kaplan) inform batch sizes and learning rates; explicit numbers per stage for Qwen3-8B are not made public (Yang et al., 14 May 2025).

3. Inference Modes and Thinking Budget

Qwen3-8B-Base provides two inference regimes:

Thinking mode: The model emits explicit chain-of-thought reasoning, enclosed in > …</think> tags. > > - Non-thinking mode: Only the final answer is produced, with an empty <think> block for template consistency.

Modes are controlled by /think or /no_think flags in the prompt. The "thinking budget" mechanism enforces an upper bound $B$ on reasoning token count; once reached, the model injects a stop-thinking instruction and switches to generating the answer. Empirically, increasing $B$ consistently improves performance on STEM, code, and math, at the cost of latency (Yang et al., 14 May 2025).

4. Resource and Deployment Characteristics

Qwen3-8B-Base requires approximately 16–24 GB of A100-class GPU memory (FP16) for 128K context windows. Throughput is context and mode dependent: non-thinking mode achieves ≈600–800 tokens/s, thinking mode ≈400–600 tokens/s on a single A100 GPU. Recommended deployment optimizations include INT8/INT4 quantization with flash attention, careful context window management with YARN and DCA, and disabling reasoning for best latency (Yang et al., 14 May 2025). With LoRA adapters and 4-bit quantization, full fine-tuning and inference can be performed on a single A100 40 GB GPU (Amorin et al., 30 Nov 2025).

5. Benchmark and Downstream Performance

Qwen3-8B-Base outperforms or matches open LLMs such as Llama-3-8B and Qwen2.5-7B on a broad set of tasks, notably exhibiting:

MMLU (5-shot): 76.89
GSM8K (4-shot, CoT): 89.84
MATH (4-shot, CoT): 60.80
HumanEval+ (0-shot): 67.65
MBPP (0-shot): 69.80

In agentic tool-use and coding tasks, its unrevised baseline scores in thinking/non-thinking modes on benchmarks such as BFCL v3 (68.2%/59.8%), $\tau$ -bench Retail (45.2%/35.7%), SWE-bench Verified (9.8%/8.0%) are established for further post-training comparison (Wang et al., 8 Nov 2025). For financial sentiment classification, it leads or matches comparable LLMs in both zero-/few-shot (e.g., FPB 0.82/0.78 accuracy/F1) and supervised settings, achieving strong results even with only 5% of the training set (Amorin et al., 30 Nov 2025).

6. Post-training and Fine-tuning Applications

Qwen3-8B-Base is widely used as a policy backbone for advanced RL or posttraining pipelines. Notable examples include:

Minimal Test-Time Intervention (MTI): Selective classifier-free guidance (CFG) and lightweight negative-prompt correction based on per-token entropy, achieving +1.35% average accuracy with <5% overhead by gating CFG to ≈4–10% of tokens (Yang et al., 15 Oct 2025).
Asymmetric PPO Fine-tuning: Using two mini-critics from Qwen3-1.7B, prompt-level data sharding, and $\sigma$ -gated advantage/entropy masking, AsyPPO yields +3.2 pp over PPO on math reasoning tasks (Liu et al., 2 Oct 2025).
Reinforcement Learning with Verifiable Rewards (RLVR): For sheet-music reasoning, fine-tuned Qwen3-8B-Base with GRPO on synthetic, verifiable music theory problems gives +13 pp accuracy (from 57.9%→70.94%) on SSMR-Bench and increases performance on math and music generation, surpassing GPT-4 zero-shot in reasoning metrics on MusicTheoryBench (Wang et al., 4 Sep 2025).

7. Multilingual and Multimodal Capabilities

Qwen3-8B-Base extends Qwen2.5’s 29-language support to 119 languages/dialects, enabled by data mixture optimization and explicit multilingual pretraining. It is natively instruction-following and supports both natural and code-based queries in diverse scripts and encodings. Performance on multilingual datasets is not detailed explicitly for Qwen3-8B-Base, but the technical report attributes its broad generalization to this extensive linguistic pretraining. Multimodal variants (not covered here) originate from analogous training on vision-language datasets (Yang et al., 14 May 2025).

References:

(Yang et al., 14 May 2025) Qwen3 Technical Report
(Wang et al., 8 Nov 2025) Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling
(Amorin et al., 30 Nov 2025) Fine-tuning of lightweight LLMs for sentiment classification on heterogeneous financial textual data
(Liu et al., 2 Oct 2025) Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
(Yang et al., 15 Oct 2025) Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
(Wang et al., 4 Sep 2025) Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning