Olmo 3: Open-Source Transformer Models

Updated 19 December 2025

Olmo 3 is a family of fully open-source decoder-only transformer models at 7B and 32B scales, designed for long-context reasoning, function calling, coding, and instruction following.
The architecture uses Sliding-Window Attention (SWA) and Grouped-Query Attention (GQA) to efficiently handle context lengths up to 65K tokens with rotary position embeddings.
Its transparent training pipeline publishes all stages, data compositions, and safety evaluations, ensuring reproducibility and rigorous validation in open LLM ecosystems.

Olmo 3 is a family of fully open-source decoder-only transformer-based LLMs released at two parameter scales: 7 billion (7B) and 32 billion (32B). Designed specifically for advanced long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall, Olmo 3 establishes a transparent model flow by publishing every step, checkpoint, and dependency in its lifecycle. Its flagship, Olmo 3 Think 32B, is recognized as the strongest fully-open thinking model released to date (OLMo et al., 15 Dec 2025).

1. Model Architecture

Olmo 3 employs a dense transformer backbone, modified to facilitate scalable long-context processing and efficient attention mechanisms. Both model sizes implement Sliding-Window Attention (SWA): in three out of every four layers, attention is limited to a local window of 4096 tokens, with unrestricted full attention in the final layer. This reduces per-layer attention complexity from $O(L^2)$ to $O(LW + LV)$ , where $L$ is the sequence length (up to 65,536 after extension) and $W$ is the window size (4096).

The 32B variant enhances speed with Grouped-Query Attention (GQA), clustering attention heads in groups of size 5. Both models utilize rotary position embeddings (RoPE, $\theta_0 = 5\times10^{5}$ ), RMSNorm for normalization, and 8K token context windows during pretraining. Post-training, context length is extended to 65K tokens via YaRN RoPE scaling, restricted to full-attention layers.

Architecture Specifications

Model	Layers	Hidden Size	Attention Heads	Attention Modifications
7B	32	4096	32	SWA
32B	64	5120	40 (GQA)	SWA, GQA

2. Training Pipeline and Data Composition

Olmo 3's training pipeline is fully open, with all code, data mixes, checkpoints, and configurations published. Training progresses through three stages:

Stage 1: Pretraining on Dolma 3 Mix (6T tokens), sourced from a 9T token pool of web text, academic PDFs (238M via olmOCR), GitHub code, FineMath, arXiv LaTeX, and Wikipedia/Wikibooks. Three-pass trillion-scale deduplication (exact hash, MinHash clustering, fuzzy suffix arrays) reduces documents by 75%. Data are classified into 480 topic-quality buckets using FastText descents of WebOrganizer, and optimized via swarm-based mixture optimization (Olmix). Quality-aware upsampling biases high-quality web data.
Stage 2: Midtraining on Dolma 3 Dolmino Mix (100B tokens), distilled from a 2T pool, focuses on code, math, general QA, instruction, and science PDFs. A two-part process employs micro-anneals (proxy models for quick signal) and integration tests (100B integration midtrains). Synthetic sources include TinyMATH, CraneCode, Reddit-to-Flashcards, TinyMATH-style meta-reasoning, Tulu 3, and Flan. Model soup merging raises 32B midtrain scores.
Stage 3: Long-Context Extension leverages Dolma 3 Longmino Mix (50B for 7B, 100B for 32B), with a 639B token pool of >8K token PDFs (filtered for compressibility) and synthetic aggregation tasks (CWE, REX). Tokens are mixed (34% long, 66% short) and trained with YaRN on full-attention layers, document packing, and intra-document masking to extend context up to 65K.

All resources are released openly through public repositories.

3. Design Objectives and Advanced Capabilities

Olmo 3 excels in long-context inference, function calling, coding (including fill-in-the-middle tasks), instruction following, chat, and knowledge recall. Design advances include:

Structured Thinking Traces: Supervised finetuning (Dolci Think SFT) incorporates meta-reasoning—self-awareness, backward-chaining, verification, strategy selection, and conceptual reasoning.
Delta Learning via Direct Preference Optimization (DPO): Dolci Think applies a delta objective pairing responses from strong and weak models to maximize capability deltas. The DPO objective follows:

$\max_\theta \mathbb{E}_{(x,y_c,y_r)}\left[\log\sigma\left(\frac{\beta}{|y_c|}\ln\frac{\pi_\theta(y_c|x)}{\pi_{\mathrm{ref}}(y_c|x)} - \frac{\beta}{|y_r|}\ln\frac{\pi_\theta(y_r|x)}{\pi_{\mathrm{ref}}(y_r|x)}\right)\right]$

RL with Verifiable Rewards (RLVR): Built on GRPO and DAPO, Olmo 3 Think uses off-policy sampling, token-level PPO loss with advantage clipping and truncated importance sampling, eschewing KL penalty. The RL objective is:

$\mathcal{J}(\theta)=\frac{1}{\sum_i |y_i|}\sum_{i,t}\min\left(r_{i,t}A_{i,t},\, \mathrm{clip}(r_{i,t},1-\varepsilon,1+\varepsilon)\,A_{i,t}\right)$

where $r_{i,t}$ is a token-wise ratio of current to prior policy, and $A_{i,t}$ is group-wise advantage. Rewards are diverse and verifiable: full-correct answers in math (SymPy), pass@k for code, adherence to IF constraints, and quality judged by LM.

4. Evaluation Protocols and Performance Metrics

Olmo 3 is assessed through base and post-training evaluations across diverse benchmarks.

Base Model Performance

Benchmark	Olmo 3 7B	Marin/Apertus (7–8B)	Olmo 3 32B	Marin/Apertus (32–70B)
Math	54.7	39.6 (Marin 8B)	69.7	49.3/39.7
Code	30.7	21.4	39.7	30.8/23.3
STEM MCQA	66.4	68.1	75.6	75.9/70.0
GenQA	72.5	71.6	79.4	80.3/75.0

Long-Context Performance

RULER dev scores (needle-in-haystack, aggregation):

Model	4K	8K	16K	32K	65K
7B	94.9	91.2	84.1	78.8	67.9
32B	96.1	94.6	90.4	86.2	79.7

HELMET held-out: 7B up to 36.8, 32B up to 52.11.

Post-training Models

Olmo 3 Think 32B achieves:

MATH: 96.2 (vs Marin 32B 36.8, Apertus 70B 36.2)
AIME 24: 80.6
BigBenchHard: 88.6
HumanEval+: 91.5
CodexPass: 91.5
IFEval: 93.8
MMLU: 86.4
GPQA: 57.5

Extended RL (Think 3.1) improves math and instruction scores further.

5. Safety and Robustness Evaluations

Olmo 3 is validated on 12 safety tasks, including HarmBench, DAN, WildGuard, WildJailbreak, XSTest, TrustLLM, Toxigen, StrongReject, WMDP, and BBQ. Think/Instruct variants outperform prior open models on refusal accuracy, with Olmo 3 Think 32B achieving up to 100% on Toxigen and >90% on HarmBench, DAN, WildGuard, etc.

6. Openness, Release Practices, and Resources

Olmo 3’s fully-open model flow distinguishes it in the open-source LLM ecosystem. All code for pretraining, midtraining, LC extension, supervised finetuning, DPO, RLVR, data recipes, deduplication, and evaluation (OLMES, decon) is published on GitHub. Data mixes and pools are available on Hugging Face, and checkpoints are distributed via Hugging Face and Weights & Biases.

All stages, configurations, and model artifacts are fully documented and accessible, enabling comprehensive reproducibility and further investigation within the research community (OLMo et al., 15 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Olmo 3 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Olmo 3.