Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPT-OSS: Open Weight Transformer Models

Updated 2 July 2026
  • GPT-OSS models are a family of open-weight transformer-based LLMs that integrate mixture-of-experts architecture, chain-of-thought reasoning, and agentic tool use.
  • They offer efficient inference with reduced active parameters and support scalable deployment on commodity hardware for safety-critical applications.
  • The models facilitate in-depth auditability and robust performance benchmarks in code generation, mathematical problem-solving, and tool integration.

GPT-OSS models are a family of open-weight, transformer-based LLMs developed by OpenAI, notable for their mixture-of-experts (MoE) architecture, chain-of-thought (CoT) reasoning capability, and agentic tool use. They are released under a permissive Apache 2.0 license, with model weights, inference implementations, tool environments, and tokenizers available for local deployment and further research, enabling auditability and controllable inference for safety-critical and privacy-sensitive applications (OpenAI et al., 8 Aug 2025, Bi et al., 17 Aug 2025, Michelet et al., 3 Dec 2025). The flagship models—gpt-oss-20b and gpt-oss-120b—target efficient, high-fidelity natural language reasoning for code generation, advanced mathematical problem-solving, safety, and agentic workflows, while supporting extended context windows and multi-channel structured outputs (Wicaksono et al., 21 Sep 2025, Kumar et al., 22 Aug 2025).

1. Architecture and Model Variants

GPT-OSS models adopt a Pre-LN autoregressive transformer backbone with interleaved mixture-of-experts feed-forward layers, yielding marked efficiency and deployment advantages over dense models (OpenAI et al., 8 Aug 2025, Kumar et al., 22 Aug 2025, Bi et al., 17 Aug 2025):

  • Model scale: gpt-oss-20b (20.9B total, 3.61B active params), gpt-oss-120b (116.8B total, 5.13B active), typical layer counts ≈24 (20b) and 36 (120b), Transformer depth and width scaled to match state-of-the-art open models (OpenAI et al., 8 Aug 2025).
  • MoE design: Each MoE block contains a router network that selects a top-k subset of experts (e.g., k=4 for 20b, k=8 for 120b) from a larger pool (32–128 experts per block), dramatically reducing per-token active parameters to ≈17%–4% of total count (Kumar et al., 22 Aug 2025, OpenAI et al., 8 Aug 2025).
  • Self-attention: Multi-query grouped attention (H_q=64, H_kv=8 per block), rotary positional encoding, and windowed/dense alternation. Context length extended up to 131,072 tokens (YaRN).
  • Quantization: Post-training quantization to MXFP4 (4.25 bits/param), with checkpoint sizes 12.8 GiB (20b) and 60.8 GiB (120b), supporting inference on single 16 GB (20b) or 80 GB (120b) GPUs (OpenAI et al., 8 Aug 2025).
  • Inference protocol: Harmony chat format with structured channels—analysis (CoT), commentary (tools), and final (user-facing answer)—enabling step-wise reasoning and safe agentic tool use (OpenAI et al., 8 Aug 2025, Michelet et al., 3 Dec 2025, Wicaksono et al., 21 Sep 2025, Durner, 25 Sep 2025).

2. Training Methodology and Chain-of-Thought Reasoning

GPT-OSS pretraining leverages large-scale, filtered text and code corpora. The post-training phase uniquely incorporates both CoT and RL objectives:

  • Autoregressive LM pretraining: Denote tokens x1:Tx_{1:T}, objective maximizes L=t=1Tlogp(xtx<t)\mathcal{L} = \sum_{t=1}^T \log p(x_t | x_{<t}) (Wicaksono et al., 21 Sep 2025).
  • Chain-of-thought reinforcement learning: Models are fine-tuned to emit explicit internal "analysis" channels before the "final" answer. RL fine-tuning (PPO) maximizes expected reward for high-quality reasoning traces and safe refusals, penalizing divergence from human and tool feedback (OpenAI et al., 8 Aug 2025).
  • Variable inference effort: Users select low/medium/high reasoning levels, with log-linear accuracy gains as CoT length increases; excessive reasoning can hurt utility or introduce looping (Michelet et al., 3 Dec 2025).
  • Instruction hierarchy: Strict adherence to System > Developer > User role precedence, robust against prompt hijacking and bypass (OpenAI et al., 8 Aug 2025, Durner, 25 Sep 2025).

3. Capabilities, Tool Use, and Agentic Integration

GPT-OSS models exhibit strong agentic capabilities, including tool-use and multi-agent orchestration (OpenAI et al., 8 Aug 2025, Wicaksono et al., 21 Sep 2025, Michelet et al., 3 Dec 2025):

  • Tool calling: Functions and tool schemas (such as Python execution, browser search, repo navigation) are declared in the Developer message and invoked via Harmony protocol. Proper tool definition is required to realize full agentic performance (Mavrin, 1 Apr 2026).
  • Agentic observability: AgentSeer records every action in agentic systems, building action- and component-graphs for node/edge coverage metrics. This supports fine-grained red teaming and vulnerability tracing (Wicaksono et al., 21 Sep 2025).
  • Local deployment: Models are optimized for edge scenarios—no reliance on OpenAI APIs, full chain-of-thought auditability, and data containment for privacy/security-sensitive workflows (e.g., digital forensics, medical diagnostics, military applications) (Michelet et al., 3 Dec 2025, Fitzgerald et al., 30 Oct 2025, Bandara et al., 29 Oct 2025).
  • Agentic red teaming: Model-level vulnerabilities often fail to predict deployment-phase failures. Agentic-only attack vectors emerge in tool-using contexts (ASR +24% vs non-tool); some vulnerabilities are exclusive to agentic loops (Wicaksono et al., 21 Sep 2025).

4. Performance, Efficiency, and Benchmarking

Explicit benchmarking places GPT-OSS variants in the mid-to-upper tier of open-weights models, with unique efficiency profiles (Bi et al., 17 Aug 2025, Kumar et al., 22 Aug 2025):

Model Params (B) Active (B) GPU Mem (GB) Throughput (tok/s) Energy (rel.) HumanEval MMLU C-Eval
GPT-OSS-20B 20.8 3.6 16 178 73% 69 45
GPT-OSS-120B 117 5.1 80 128 2.6× 71% 66 42
Qwen3-32B 32 32 64 23.7 13.1k J/1k tok 80 92 89
  • Relative strengths: Code (HumanEval 73%), math, safety; robust agentic operation in properly configured environments (Bi et al., 17 Aug 2025, Mavrin, 1 Apr 2026).
  • Resource use: MoE sparsity yields 2–3× lower energy per response, peak memory 16 GB (20B), enabling dense inference packing or deployment on commodity edge devices (Kumar et al., 22 Aug 2025, Bi et al., 17 Aug 2025).
  • Inverse scaling: 20B outperforms 120B on nearly all benchmarks (Δ≈+2–3%, p<0.01p < 0.01 for most), contradicting classical scaling laws—potentially due to cleaner expert routing and reduced overfitting in sparse setups (Bi et al., 17 Aug 2025).
  • No accuracy drop after optimization: Puzzle-optimized derivatives (e.g., gpt-oss-puzzle-88B) retain or exceed baseline accuracy while reducing parameter count by 27% and boosting throughput by up to 2.8× on a single H100 (Bercovich et al., 12 Feb 2026).

5. Security, Sociopragmatics, and Evaluation-Awareness

GPT-OSS exposes new dimensions in safety, prompt engineering, and adversarial robustness:

  • Attack surfaces: Agentic-level attacks using tool calls and memory state are more effective than standalone prompt injection (max ASR=46%46\% in tool contexts); highest-risk actions include agent transfer operations and code execution (Wicaksono et al., 21 Sep 2025).
  • Sociopragmatic bypasses: Refusal behavior is highly sensitive to prompt framing, persona, language, and formality. Composite “educator” prompts can flip refusal -> assistance rates from 0% to 97.5% on offensive tasks; cross-lingual leakage is significant (FR/DE registers leakier than EN) (Durner, 25 Sep 2025).
  • Role-play and override: Well-crafted role-play is sufficient to bypass guardrails unless AI-assisted hardening is implemented (e.g., context exfiltration drops from 85% to 0% in properly hardened prompts) (Durner, 25 Sep 2025).
  • Evaluation-awareness: “Evaluation scent” (rubric-scented prompts, oversight language) triggers verbose CoT and improved output format with only weak or inconsistent accuracy gains; answer-only compliance and real-world discipline often decrease (Ahmed et al., 8 Oct 2025).
  • Failure modes: Quant fever, reasoning blackholes, Schrodinger’s compliance, reasoning-procedure mirage, and chain-oriented prompting amplify adversarial risk and require specialized mitigation strategies such as diversified refusal patterns, global context tracking, and procedural-form detection (Lin et al., 28 Sep 2025).

6. Applications, Downstream Impact, and Ecosystem

GPT-OSS models have enabled specialized downstream systems owing to their open weights and structured outputs:

  • Digital forensics: Chain-of-thought transparency enhances auditability and legal chain-of-custody; quantitative CoT Score rigorously assesses reasoning quality (maximized at medium depth), but final answer correctness is limited by task ambiguity and context completeness (Michelet et al., 3 Dec 2025).
  • Medical diagnosis: Integration into LLM agentic ensembles improves consistency and transparency in psychiatric code assignment. Consensus-aggregation followed by GPT-OSS reasoning yields statistically significant diagnostic improvements (acc: 92%, macro-F1: 0.90) with high clinician acceptance (Bandara et al., 29 Oct 2025).
  • Military deployment: EdgeRunner 20B, a fine-tuned GPT-OSS-20B, achieves statistically significant parity with GPT-5 on military-specific tasks, with no significant regression on general benchmarks. Secure, air-gapped operation is achieved on consumer hardware (RTX 5090, MacBook Pro), with cost and speed superiority over cloud APIs (Fitzgerald et al., 30 Oct 2025).
  • Knowledge distillation: Reasoning traces from GPT-OSS used to post-train smaller student models significantly improve efficiency; GPT-OSS traces are 4–4.4× shorter than DeepSeek R1, maintaining equivalent downstream math performance (Shmidman et al., 24 Nov 2025).

7. Open Weights, Reproducibility, and Community Ecosystem

GPT-OSS stands out for transparency and reproducibility in open LLM research:

  • Licensing: All major weights, inference stacks, and tool harnesses are released under Apache 2.0 (OpenAI et al., 8 Aug 2025). Harmony agent harnesses and evaluation scripts ensure replicability (see HarmonyAgent at https://github.com/borislavmavrin/harmonyagent.git) (Mavrin, 1 Apr 2026).
  • Open agent stack: Accurate reproduction of published tool-call benchmarks is only possible with in-distribution tools defined in the Harmony protocol. Bypassing standard Chat Completions clients is essential to prevent 10–30+ point understatements of agentic capability (Mavrin, 1 Apr 2026).
  • Audit trails and artifacts: Datasets, prompt banks, adjudication harnesses, and seeds are routinely published to support auditability, error tracking, and model comparison (Durner, 25 Sep 2025, Ahmed et al., 8 Oct 2025).
  • Community-driven development: GPT-OSS influences and is reflected in adjacent projects such as GPT4All and HuggingFace releases, facilitating rapid downstream adaptation, quantization toolchains, and domain-specific agent harnesses (Anand et al., 2023).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPT-OSS.