Papers
Topics
Authors
Recent
Search
2000 character limit reached

Step 3.5 Flash: Sparse MoE Transformer

Updated 13 February 2026
  • Step 3.5 Flash is a sparse Mixture-of-Experts Transformer that activates only a fraction of its 196B total parameters per token for efficient inference.
  • It employs a hybrid attention mechanism combining sliding-window and global layers with Multi-Token Prediction to reduce latency and memory usage on long-context tasks.
  • Its unified training regimen, including pre-training, supervised fine-tuning, and reinforcement learning with off-policy optimization, delivers competitive performance on agentic, mathematical, and coding benchmarks.

Step 3.5 Flash is a sparse Mixture-of-Experts (MoE) Transformer designed to combine frontier-level reasoning and agentic intelligence with highly efficient inference, achieved by activating only a small fraction of its total parameters per token. Its architecture, training methodology, RL framework, and performance results position it as a reference design for practical deployment of complex agents in computationally constrained environments. Below, the major facets of Step 3.5 Flash are systematically addressed, integrating model innovations, optimization techniques, achieved results, and implications for scalable agentic systems (Huang et al., 11 Feb 2026).

1. Model Architecture and Mixture-of-Experts Backbone

Step 3.5 Flash is fundamentally a sparse MoE Transformer with top-K expert routing. The network consists of 45 transformer layers: the first three are dense, followed by 42 MoE layers. Each MoE layer contains 288 dedicated experts and one shared expert, with routing parameter K=8K=8, such that each token activates its top-8 experts per MoE layer. Total model size is 196 billion parameters, but only 11 billion are active per token during inference.

Attention is managed using a repeated 4-layer block motif: three sliding-window-attention (SWA) layers (window size 512) followed by one full global attention layer with GQA-8. This S3F1 pattern yields O(TWd)O(TW d) complexity for most layers and O(T2d)O(T^2 d) only every fourth layer, markedly reducing both latency and memory consumption during prefill for long contexts, while preserving global context mixing. Head-wise gated attention further increases efficiency by inserting an input-dependent gate per head:

gi=σ(wgatexi),oi=gijαi,jvjg_i = \sigma(w_{\text{gate}}^\top x_i), \qquad o_i = g_i \sum_j \alpha_{i,j} v_j

This mechanism is equivalent to adding a learnable "sink mass" to the softmax, allowing attention heads to ignore unused window positions with minimal computational overhead.

Additionally, a Multi-Token Prediction (MTP-3) head predicts up to three future tokens in parallel, with the auxiliary MTP losses:

LLM=t=1TlogP(xtx<t),LMTP=h=13λht=1ThlogP(xt+hht)\mathcal{L}_{\text{LM}} = -\sum_{t=1}^T \log P(x_t \mid x_{<t}), \qquad \mathcal{L}_{\text{MTP}} = \sum_{h=1}^3 \lambda_h \sum_{t=1}^{T-h} -\log P(x_{t+h}\mid h_t)

This approach shortens wall-clock time for both prefill and autoregressive decoding, which is essential for interactive, multi-turn agentic tasks.

2. Training Regimen and Optimization Framework

Training proceeds through several unified, staged steps:

  • Pre-training and Mid-training: The model is pre-trained on 17.6T tokens with context lengths scaled from 4k to 32k. The corpus blends open-web ("StepCrawl"), curated STEM/coding, programmatic contributions, and synthetic tool/reasoning exemplars. Diagnostics (loss monitoring, expert activity, activation spikes) guide mitigation, including load balancing and activation clipping in FFNs.
  • Supervised Fine-Tuning (SFT): Two-stage SFT over 7.23B tokens balances tasks across Math, Coding, QA, Logic, Tool use, and long-context interaction.
  • Expert RL and Self-distillation: Domain-specific RL (Math, Code, Tools, Search, long-context, human preference) is first applied to create specialized expert agents. These are then distilled via trajectory-matching into a single generalist student, unifying capabilities.

Scalable Off-Policy RL—Metropolis Independence Sampling Policy Optimization (MIS-PO)

Traditional off-policy RL struggles with high variance in token-level probabilities accumulated over long horizons, especially for MoE routing. MIS-PO addresses these challenges via discrete filtering of importance weights, rather than continuous scaling:

xt=πθold(atst)πθvllm(atst),ρˉ(τ)=(txt)1/Tx_t = \frac{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}{\pi_{\theta_{\text{vllm}}}(a_t \mid s_t)}, \qquad \bar{\rho}(\tau) = \left(\prod_t x_t\right)^{1/T}

The actor loss discards samples outside [ρmin,ρmax][\rho_{\min}, \rho_{\max}] at token or trajectory granularity,

Lactor=Eτπvllm[I(ρminxtρmax)I(ρminρˉ(τ)ρmax)logπθ(atst)A^t]\mathcal{L}_{\text{actor}} = -\mathbb{E}_{\tau\sim\pi_{\text{vllm}}}\Big[\mathbb{I}(\rho_{\min}\le x_t\le\rho_{\max}) \mathbb{I}(\rho_{\min}\le\bar\rho(\tau)\le\rho_{\max}) \log\pi_\theta(a_t|s_t)\hat A_t\Big]

MIS-PO empirically yields threefold faster convergence on long-horizon reasoning, fivefold lower gradient norm variance, and more stable exploration (entropy decay) compared to PPO or GSPO.

Additional methods for robustness include truncation-aware value bootstrapping and monitoring of routing confidence (Σk\Sigma_k)—average expert mass—to detect brittle updates.

3. Reward Modeling, Verification, and Generalization

Step 3.5 Flash uses a hybrid RL signal design:

  • Verifiable RL Rewards (RLVR): Automated, rule-based, or model-checked signals are provided for Math, Code, and Tool-use tasks.
  • Preference Rewards: For domains lacking verifiable signals (e.g., open-domain dialog), pairwise LLM scoring (GenRM + MetaRM) is utilized alongside logic-error penalties.
  • Agentic Rewards: Human-like tasks such as search and report generation use rubric-based LLM scoring.

This dual reward pathway enables robust agentic learning across highly structured and unstructured domains.

4. Empirical Results and Benchmarking

The model demonstrates closed-source competitive or superior performance on several agentic, mathematical, and coding tasks:

Task Step 3.5 Flash Gemini 3.0 Pro GPT-5.2 xHigh
IMO-AnswerBench 85.4 83.3 86.3
LiveCodeBench-v6 86.4 90.7 87.7
τ²-Bench 88.2 90.7 92.5
BrowseComp (Ctx-Mgr) 69.0 59.2 65.8
Terminal-Bench 2.0 51.0 56.9 54.0

Throughput is approximately 170 tokens/s at 128k context length, with a sub-128GB memory requirement. Multi-round agentic interactions (e.g., Terminal-Bench) complete within 2,000 turns, with mean per-turn latency of ∼240 ms.

A notable real-world deployment pattern uses Step 3.5 Flash for cloud-level planning in conjunction with edge agents (e.g., Step-GUI), increasing GUI task success on AndroidDaily Hard from 40% to 57% by decomposing cognitive and control sub-tasks between cloud and device.

5. Design Trade-offs, Practical Implications, and Limitations

  • Efficiency Frontier: By decoupling model capacity from inference compute via MoE sparsity (196B total vs. 11B active), agentic reasoning is feasible on resource-bound infrastructures.
  • Hybrid Attention and MTP: The interleaved S3F1 pattern and MTP parallel decoding combine to minimize both prefill and generation latency, especially for long, interactive sessions.
  • Deployment: The efficient memory/performance envelope of Step 3.5 Flash supports industrial, edge-cloud, and collaborative multi-agent deployments.

However, token efficiency remains a challenge: despite speculative decoding and MTP, the model sometimes produces longer outputs to achieve accurate reasoning calibration, compared to Gemini 3.0 Pro. Further, while reinforcement learning achieves strong results on benchmarks, scaling to open-world, professional-level tasks remains open. Distribution shift and stability for ultralong (≫128k) multi-turn sessions are largely unproven. Achieving seamless expert integration with minimal distillation cost for universal generalists is still an unsolved problem.

6. Significance, Impact, and Future Directions

Step 3.5 Flash marks a convergence of architectural sparsity, hybrid attention, parallel decoding, and stabilized RL training for open-agent intelligence at industrial scale. The design suggests that strong performance and robust multi-domain reasoning are not constrained by inference bottlenecks, provided token-activation and training objectives are suitably aligned. Prospective work includes:

  • Gradient-based pruning of unnecessary thought traces to further optimize output length and efficiency.
  • Expansion of RLVR to cover more professional and scientific workloads.
  • Techniques for dynamic expert recruitment within MoE during online adaptation and transfer.
  • Robustification against distribution shift and compositional generalization in long-horizon dialogues.

The model's open reporting of benchmarks, latency, and system-level integration protocols establishes a template for reproducibility and further research on scalable, real-world agentic architectures (Huang et al., 11 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Step 3.5 Flash.