Papers
Topics
Authors
Recent
2000 character limit reached

Open Pre-Trained Transformer (OPT)

Updated 24 November 2025
  • Open Pre-Trained Transformer (OPT) is a suite of decoder-only models ranging from 125M to 175B parameters that emphasizes transparency, reproducibility, and environmental efficiency.
  • The models are pre-trained on 180B English tokens using GPT-2 byte-level BPE tokenization, robust deduplication, and tailored learning strategies for optimal performance.
  • OPT demonstrates comparable performance to GPT-3 through systematic benchmarking while significantly reducing its carbon footprint, setting new standards for open research.

Open Pre-Trained Transformer (OPT) refers to a suite of decoder-only Transformer LLMs developed to mirror the capabilities of GPT-3 while prioritizing transparency, reproducibility, and environmental efficiency. Ranging from 125 million to 175 billion parameters, the OPT family is designed to facilitate open research in large-scale language modeling by providing public access not only to model checkpoints but also to the accompanying infrastructure logbook, training codebase, and detailed documentation. The suite was developed by Meta AI and is positioned as an open, responsible resource for the academic and research communities (Zhang et al., 2022).

1. Model Architecture and Parameterization

The OPT models employ a standard “GPT-style” architecture, consisting of decoder-only Transformer blocks with several notable characteristics:

  • LayerNorm-First Residual Blocks: Layer normalization is applied before attention and MLP layers (LayerNorm-first).
  • Activation Function: ReLU is used in place of GELU.
  • Tokenization: Models use GPT-2 byte-level BPE vocabulary.
  • Context Window: All models accept up to 2048 tokens as context.

The family comprises a broad range of model sizes, all sharing this micro-architecture but varying in depth (LL), hidden dimension (dmodeld_\mathrm{model}), and attn heads. The parameter (PP) scaling law is expressed as

P12Ldmodel2P \approx 12 L d_\mathrm{model}^2

where LL is the number of layers and dmodeld_\mathrm{model} is the hidden dimension. FLOPs per token also scale as Ldmodel2L d_\mathrm{model}^2.

Key configurations are organized as follows:

Model Layers (LL) Hidden Size (dmodeld_\mathrm{model}) Attention Heads (HH) Parameter Count
OPT-125M 12 768 12 \sim1.2 ×\times 108^8
OPT-350M–66B varies varies varies see released configs
OPT-175B 64 12,288 96 \sim1.75 ×\times 1011^{11}

All models use an intermediate size of approximately 4dmodel4d_\mathrm{model} in feedforward networks.

2. Pre-training Regimen and Data Composition

OPT models were pre-trained via language modeling over approximately 180 billion English tokens. The data pipeline emphasized deduplication, diversity, and tokenization consistency:

  • Corpus Composition:
    • RoBERTa text: Including BookCorpus, Stories, and CCNews v2 (up to September 2021).
    • Subset of "The Pile": Incorporating sources such as CommonCrawl, DM Mathematics, Gutenberg, HackerNews, OpenSubtitles, OpenWebText2, USPTO, and Wikipedia.
    • Pushshift.io Reddit: Using the longest-chain conversations.
  • Deduplication: MinHash LSH (Jaccard \geq 0.95) is employed for deduplication.
  • Tokenization: GPT-2 byte-level BPE is applied for consistency with other open LMs.

Pre-training used a sequence length of 2048 tokens, a global batch size scaling with model size (e.g., 0.5M to 4M tokens), and AdamW optimization (β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, weight decay $0.1$). Dropout is set to $0.1$ (excluding embeddings), and gradient norm clipping is applied—initially at $1.0$ and later reduced to $0.3$ for stability. Learning rates follow a linear warm-up to peak, then linear decay to 10% over 300B tokens (e.g., OPT-175B uses a peak LR of 1.2×1041.2 \times 10^{-4}).

3. Infrastructure, Training Logistics, and Monitoring

Training the largest OPT configurations required specialized infrastructure and elaborate monitoring:

  • Hardware: 992 NVIDIA A100 GPUs with 80 GB VRAM each, achieving 147 TFLOP/s/GPU sustained throughput using mixed precision (FP16 for weights, FP32 for Adam state).
  • Parallelism: Fully Sharded DataParallel (FSDP) and Megatron's tensor parallelism were integral to efficient scaling.
  • Timeline: Training the 175B model required approximately two months, including 35 manual and about 70 automatic restarts due to hardware failures.
  • Monitoring: Frequent checkpoints, dynamic loss scaling, and a comprehensive logbook recording all training instabilities, rate schedule changes, and hardware interruptions were maintained for reproducibility.

4. Evaluation, Benchmarks, and Model Performance

OPT-175B was systematically benchmarked using zero- and few-shot protocols analogous to those of GPT-3.

  • Zero-Shot Benchmarking: On 14 tasks, including HellaSwag, PIQA, StoryCloze, ARC, OBQA, WinoGrande, and SuperGLUE subsets, OPT-175B matches GPT-3's performance.
  • Few-Shot Learning: OPT-175B tracks GPT-3 on most tasks with minor divergences (notably MultiRC, RTE) attributed to prompt engineering.
  • Dialog Modeling: On datasets such as ConvAI2, Wizard-of-Wikipedia, Empathetic Dialogue, BST (Blended), and Wizard-of-Internet, OPT-175B yields perplexity in the 10–13 range and unigram F1 near 0.15–0.18, outperforming Reddit-2.7B (unsupervised) and approaching BlenderBot-1 (supervised) on several metrics.
  • Validation Perplexity: At convergence, OPT-175B achieves validation perplexity \sim7.0.

This demonstrates comparable language modeling and few-shot capabilities to GPT-3 at significantly reduced environmental cost (Zhang et al., 2022).

5. Environmental Impact and Efficiency

OPT models were designed with explicit attention to environmental impact, incorporating monitoring and reporting methodologies:

  • CO₂ Footprint: OPT-175B training resulted in approximately 75 tons CO₂-equivalent emissions, about 1/7th of GPT-3's estimated 500 tons. For context, Gopher (DeepMind) required ≃380 tons.
  • Measurement: Power draw data from the GPU cluster was combined with the data center’s PUE factor. Compute downtime, ablation runs, hardware restarts, and all interruptions are logged, ensuring transparency.
  • Significance: The achieved efficiency establishes a precedent for carbon-aware training of LLMs in academic and industrial research.

6. Release, Documentation, and Reproducibility

Meta AI's approach with OPT prioritizes comprehensive openness:

  • Model Availability: Weights for all models from 125M to 66B parameters are publicly released on GitHub; 175B weights are obtainable via non-commercial research license upon request (available to academia, government, civil society, and industry for non-commercial use).
  • Codebase: The full training and inference pipeline is released as "metaseq."
  • Documentation: Includes model cards (with bias, toxicity, and limitations data), datasheets for the training corpus, and a detailed infrastructure logbook chronicling operational challenges and adjustments.
  • Licensing: Smaller models use Apache 2.0; OPT-175B is licensed for non-commercial research.

This degree of transparency enables both the reproduction of principal results and a nuanced understanding of the operational realities inherent in training models at 175B parameter scale (Zhang et al., 2022).

7. Impact, Context, and Significance

The OPT release exemplifies a new paradigm in responsible large-scale language modeling, establishing widely recognized standards for open science in neural language modeling. By making available not only model weights but also training logs, hyperparameters, and code, OPT addresses reproducibility barriers in large model research, lowers the threshold for empirical validation and ablation studies, and catalyzes broader community scrutiny of large pre-trained models’ limitations and capabilities.

A plausible implication is the acceleration of research into scaling laws, prompt engineering, and ethical deployment, as open scrutiny makes it feasible to interrogate model behavior, optimize efficiency, and propose mitigations more rapidly. The full spectrum of transparency—spanning data, code, checkpoints, and operational logbooks—positions OPT as a reference point for future work in both the methodology and societal impact of LLMs (Zhang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Open Pre-Trained Transformer (OPT).