GPT-OSS-120B: Open-Source 120B MoE LLM

Updated 2 July 2026

GPT-OSS-120B is a 120-billion-parameter open-weight mixture-of-experts LLM featuring sparse expert routing and chain-of-thought reinforcement learning.
Its architecture employs 80 transformer layers with 128 experts per layer, activating only about 3.4% of parameters per forward pass to optimize efficiency.
Empirical evaluations show strong performance in mathematics, coding, and logic tasks, with notable explainability and robust safety for on-premise deployment.

gpt-oss-120b is a 120-billion-parameter open-weight Mixture-of-Experts (MoE) LLM released by OpenAI in August 2025. It represents a significant advance in open-source reasoning LLMs, combining architectural scale, chain-of-thought optimized reinforcement learning, high efficiency via sparsely activated expert routing, and an Apache 2.0 license that explicitly permits broad research and commercial use. While it delivers strong performance on mathematics, coding, logic, and some multilingual benchmarks, it demonstrates varied cost-accuracy trade-offs, unique safety properties, and distinct behavioral patterns compared to both its smaller sibling, gpt-oss-20b, and leading proprietary LLMs.

1. Model Architecture and Training Paradigm

gpt-oss-120b is implemented as an autoregressive transformer extended with Mixture-of-Experts (MoE) layers in place of traditional dense feed-forward sublayers. The canonical configuration comprises approximately 80 transformer layers, each with a residual stream dimension of 12,288, grouped-query attention (typically 96 heads/layer), rotary position embeddings, and a sequence length of up to 131,072 when used with YaRN extension (OpenAI et al., 8 Aug 2025, Michelet et al., 3 Dec 2025). Each MoE block hosts 128 experts; per-token routing activates the top-4 experts, with routing scores produced by a learned linear router $W_g \in \mathbb{R}^{E \times d}$ .

The expert MLPs employ a two-layer SwiGLU activation with a clamp+residual function. Sparsity is extreme: only $\approx$ 3.4% of the model's parameters (5.1B) are active per forward pass, yielding a substantial reduction in inference FLOPs compared to a dense 120B transformer (OpenAI et al., 8 Aug 2025). This design enables the model to achieve similar or superior reasoning capacity within the memory and latency budgets of a single 80GB GPU (60.8GiB MXFP4-quantized checkpoint).

Pretraining is performed on trillions of web, code, and curated instruction-tuning tokens, followed by reinforcement learning from human/AI feedback (RLHF) targeting chain-of-thought (CoT) clarity, correctness, and tool-use capability. The post-training "CoT RL" phase employs supervised CoT trajectory collection, PPO-based reward optimization, and role/format tokens in the custom "harmony" chat protocol (OpenAI et al., 8 Aug 2025).

2. Reasoning Mechanisms and Explainability

The model explicitly supports chain-of-thought reasoning as a controllable inference modality. User instructions such as "Reasoning: LOW/MEDIUM/HIGH" guide the model to emit varying depths of intermediate "analysis" tokens, recorded in a separate output channel (Michelet et al., 3 Dec 2025). The mechanism parses complex queries into sequential sub-inferences, with the internal reasoning trace emitted prior to the "final" answer.

CoT quality metrics extend Lee et al.'s four-criterion scoring: factuality ( $F$ ), validity ( $V$ ), coherence ( $C$ ), and utility ( $U$ ), each computed over the number of reasoning steps $S$ :

$F = 1 - \frac{e_f}{S},\qquad V = 1 - \frac{e_v}{S},\qquad C = 1 - \frac{e_c}{S},\qquad U = 1 - \frac{e_u}{S}$

$\mathrm{CoT}_{\mathrm{score}} = \frac{F + V + C + U}{4}$

Medium-depth reasoning (8–12 steps) maximizes explainability and reviewer trust, with low-depth CoTs being underspecified, and high-depth traces frequently degenerating into redundant or looping statements (Michelet et al., 3 Dec 2025).

3. Empirical Performance and Benchmarking

General and Subject-Specific Benchmarks

Empirical evaluation situates gpt-oss-120b in the mid-to-upper tier of contemporary open-source models, with notable strengths and limitations:

Model	MMLU	GSM8K	HumanEval	FinQA	PIQA	MedQA	C-Eval	Avg.
gpt-oss-120b	66	75	71	65	71	59	42	64.8
gpt-oss-20b	69	78	73	68	74	62	45	67.7
Phi-4 14.7B	90	87	88	79	83	72	56	79.5
DeepSeek-R1 70B	88	91	88	82	85	75	68	82.4

gpt-oss-120b underperforms its 20B variant and denser models (Phi-4, DeepSeek-R1) on broad-knowledge, code, and multilingual tasks (Bi et al., 17 Aug 2025). On real-world clinical diagnostic tasks, gpt-oss-120b achieves 80–84% average accuracy, near-parity with proprietary models like o4-mini and GPT-5, and exceeds the performance of DeepSeek-R1 on some specialty subtasks (Munim et al., 18 Dec 2025).

Competitive Programming and Mathematical Reasoning

On LiveOIBench—a 403-task informatics olympiad benchmark—gpt-oss-120b achieves a 60th human percentile, 73.61% any-medal, and 47.78% pass rate. Sub-task breakdown reveals relative strength in implementation and mathematics but marked difficulty with graph algorithms, greedy methods, and dynamic programming (Zou et al., 10 Oct 2025).

In the International Olympiad in Informatics (IOI) gold-medal setting, gpt-oss-120b, coupled with the GenCluster test-time compute protocol, surpasses the gold threshold with a submitted score of 446.75/600 when K=5000 candidates are generated per task—marking the first open-weight model to do so (Samadi et al., 16 Oct 2025).

On the SAIR equational theories Stage 1 competition, prompt engineering saturates at a 71–79% balanced accuracy ceiling, with further prompt complexity yielding no gains due to undecidability and cognitive limits—gpt-oss-120b matches a single-prompt ceiling that remains lower than some stronger dense models (Cazares, 20 Apr 2026).

4. Architectural Efficiency, Trade-offs, and Scaling

The model's MoE design (128 experts/layer, top-4 routing) reduces inference costs by activating only a small subset of experts per token. The quantized (MXFP4) model fits on a single H100 (80GB). Active parameter count (5.1B) and token throughput (128 tok/s) are substantially higher cost than the 20B sibling (3.6B, 178 tok/s), with 2.6× higher energy per response (Bi et al., 17 Aug 2025).

Post-training architecture search (Puzzle framework) yields the gpt-oss-puzzle-88B derivative, demonstrating up to 2.82× throughput gains on a single H100 via heterogeneous expert pruning, selective window-attention replacement, and FP8 KV-cache quantization. This maintains suite-average reasoning accuracy (100.8–108.2% retention) and establishes an accuracy-speed Pareto frontier, showing the potential for MoE model post hoc adaptation without quality loss (Bercovich et al., 12 Feb 2026).

Scaling MoE does not guarantee proportional capability gains; routing imbalance and limited gradient flow to "cold" experts induce inverse scaling phenomena, with the 120B variant sometimes underperforming the 20B sibling (Bi et al., 17 Aug 2025).

5. Model Internal Analysis and Safety

Weight Matrix SVD and Secret Dictionary

Singular value decomposition (SVD) of the lm_head weight matrix, performed in five lines of PyTorch, reveals interpretable semantic subspaces in the model's vocabulary. The left singular vectors of U identify token clusters corresponding to functional registers: punctuation, numerals, attribute nouns, formal specification terms, and multilingual scripts (Simplified/Traditional Chinese) (Miyashita, 21 May 2026).

The Vocabulary Cluster Score (VCS) quantifies token cluster coherence, while the Weighted Projection Score (WPS) detects glitch tokens, with shokubutsu-hyakka-tsu (ID 137606) correctly flagged as a CJK glitch. High-energy subspaces show no problematic content; detected glitches arise from rare inclusions (Category C per paper's taxonomy).

Interpretability and Jailbreak Robustness

Interpretability-based jailbreaking audits (Universal Steering and RepE) show that gpt-oss-120b is uniquely robust among eight SOTA models: no steering coefficient in the examined range can induce policy-violating responses, in stark contrast to high jailbreaking rates (up to 94%) in the 20B sibling, Llama-3, and Qwen3-32B (Agarwal et al., 22 Apr 2026). Scale, expert diversity, and aggressive RLHF contribute to this robustness.

6. Reasoning Patterns and Generalization

Controlled studies of chain-of-thought (CoT) data reveal that gpt-oss-120b naturally produces convergent, deductive reasoning trajectories—~75% of steps labeled Deduce, with low frequency of branching and backtracking (Propose→Propose ~0.34). By contrast, competing teacher models (e.g., DeepSeek-R1) favor branch-heavy exploration and frequent proposal switching (Li et al., 2 Apr 2026).

Supervised fine-tuning (SFT) of student models on gpt-oss-120b CoT data, despite higher token-level SFT loss, yields a consistent 3–5% gain in downstream math reasoning accuracy versus DeepSeek-R1 data. Filtering branch-heavy trajectories in the training set closes much of this gap, identifying compact, logically dense reasoning traces as superior generalization scaffolds.

7. Practical Deployment and Tooling

gpt-oss-120b supports full on-premise inference with local explainability: all internal CoT, decision traces, and tool interactions are accessible and auditable. The quantized model can be deployed on clinical edge devices (5.5GB RAM for 4-bit quantized, 180–220ms/token on ARM CPUs) with privacy-preserving guarantees (Munim et al., 18 Dec 2025). The "harmony" chat and tool APIs permit deep research browsing, Python execution, and developer-defined function calls within a strict role-delineated protocol (OpenAI et al., 8 Aug 2025).

Agentic capabilities include deep research, stateful code execution, and function-calling within an explicit reasoning framework, supporting tasks ranging from clinical diagnosis to digital forensics traceability (Michelet et al., 3 Dec 2025, Munim et al., 18 Dec 2025).

References

(OpenAI et al., 8 Aug 2025) gpt-oss-120b & gpt-oss-20b Model Card
(Bi et al., 17 Aug 2025) Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models
(Zou et al., 10 Oct 2025) LiveOIBench: Can LLMs Outperform Human Contestants in Informatics Olympiads?
(Samadi et al., 16 Oct 2025) Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
(Michelet et al., 3 Dec 2025) Hey GPT-OSS, Looks Like You Got It - Now Walk Me Through It! An Assessment of the Reasoning LLMs Chain of Thought Mechanism for Digital Forensics
(Munim et al., 18 Dec 2025) Benchmarking and Adapting On-Device LLMs for Clinical Decision Support
(Bercovich et al., 12 Feb 2026) Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration
(Li et al., 2 Apr 2026) On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
(Cazares, 20 Apr 2026) Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
(Agarwal et al., 22 Apr 2026) Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
(Miyashita, 21 May 2026) Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)