Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 158 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Ring-mini-linear-2.0: Hybrid Transformer

Updated 23 October 2025
  • Ring-mini-linear-2.0 is a hybrid language model that combines linear and softmax attention layers within a transformer architecture for efficient long-context reasoning.
  • It leverages a sparse Mixture-of-Experts design to activate only a fraction of its 16B parameters, drastically reducing computational and memory overhead.
  • Innovative operator fusion with FP8 training and systematic architecture tuning deliver state-of-the-art performance on complex reasoning benchmarks while optimizing resource efficiency.

Ring-mini-linear-2.0 is a 16-billion parameter LLM representing the second generation of the Ring-linear family, designed for efficient long-context reasoning. It combines linear and softmax attention mechanisms in a hybrid transformer architecture and adopts a sparse Mixture-of-Experts (MoE) design to optimize inference cost, memory usage, and training efficiency. The model achieves substantial reductions in computational resources relative to comparably sized dense models and preserves high performance across a variety of complex reasoning benchmarks.

1. Hybrid Attention Architecture

Ring-mini-linear-2.0 employs a structured hybrid attention scheme, interleaving multiple linear attention layers with periodic softmax attention layers. Specifically, the transformer stack is divided into layer groups of size M = 4: four Lightning Attention (linear attention) layers are followed by one softmax attention layer. Lightning Attention computes token output O\mathbf{O} via

O=Q(KV)\mathbf{O} = \mathbf{Q} (\mathbf{K}^\top \mathbf{V})

with query Q\mathbf{Q}, key K\mathbf{K}, and value V\mathbf{V} matrices in Rn×d\mathbb{R}^{n\times d}, nn the sequence length, and dd the feature dimension. For recurrent inference, at step tt, the output is

ot=qtstλtsksvso_t = q_t \sum_{s \leq t} \lambda^{t-s} k_s^\top v_s

or recursively,

kv0=0kvt=λkvt1+ktvtot=qt(kvt)\text{kv}_0 = 0 \qquad \text{kv}_t = \lambda \cdot \text{kv}_{t-1} + k_t^\top v_t \qquad o_t = q_t \cdot (\text{kv}_t)

where kvtRd×d\text{kv}_t \in \mathbb{R}^{d \times d} is a compressed, constant-size key–value cache. This configuration ensures that the cache requirements grow constant with sequence length in linear blocks, as opposed to the O(n)O(n) scaling of softmax-based caches.

The periodic inclusion of softmax attention layers after four linear blocks maintains the capacity for tasks (e.g., information retrieval) where linear attention alone degrades accuracy. This forms the architectural basis of the hybrid regime, offering a tunable efficiency–performance trade-off.

2. Parameterization, Activation Sparsity, and MoE Design

The model consists of 16 billion total parameters. Critically, only approximately 1.6 billion parameters are activated at each inference step (957 million after excluding embedding parameters). This sparsity is achieved using a Mixture-of-Experts (MoE) approach, where only a subset of experts are conditionally activated depending on the input. MoE routing is deterministic, ensuring reproducible inference and reduced activation overhead per forward pass.

This design enables the model to offer the representational capacity of a large model while limiting compute and memory footprints during both training and inference.

3. Computational and Memory Efficiency

By leveraging Lightning Attention as the linear attention mechanism, Ring-mini-linear-2.0 exhibits constant-size KV cache requirements in linear layers and linear time/space scaling with context length. For entire long sequences, this eliminates the dominating O(n2)O(n^2) attention cost and I/O load associated with pure softmax attention.

The combined hybrid scheme reduces inference cost by an order of magnitude: compared to a 32B dense model, inference is \sim1/10 the computational cost, and relative to the previous generation Ring series, the cost is reduced by more than 50% (Team et al., 22 Oct 2025).

Efficiency is further improved by the model architecture’s optimized group structure, balancing the minimal use of softmax attention with maximal exploitation of the Lightning Attention’s linear scaling.

4. Training Optimization and Operator Fusion

Training efficiency is augmented by a custom operator library, “linghe,” targeting the FP8 numeric format. “linghe” fuses major pipeline operations—such as expert gating, normalization, and quantization—into single kernels, minimizing memory-moving overhead and kernel launches. FP8 is employed for GEMM (matrix multiplication) operations, offering higher throughput than BF16 and permitting greater micro-batch sizes within fixed memory budgets.

Additionally, the recomputation strategy is fine-grained and locality-aware, enabling deterministic MoE routing and fused quantization, as well as the storage of critical tensors (e.g., the KV cache) in FP32 for numerical stability. Cumulatively, these improvements result in a 50% increase in overall training throughput.

5. Systematic Hybrid Architecture Tuning

The optimal layering of linear and softmax blocks was determined by extensive ablation and scaling law experiments, systematically varying the group size parameter MM. Resulting data indicated that M=4M = 4 yields minimal training loss and optimal performance/cost trade-off. Models with M=0M = 0 revert to traditional softmax-only architectures, losing efficiency, while MM too large (too few softmax layers) degrades expressivity and retrieval performance.

This parameterization supports fine-grained deployment tailoring to specific hardware or application constraints.

6. Alignment for Reinforcement Learning and Stable Optimization

A critical design consideration in Ring-mini-linear-2.0 is the alignment of the training and inference operator pipelines. Key modules are shared and deterministic, including:

  • FP32 format for KV caches;
  • customized GEMM handling for the LM head;
  • deterministic MoE paths.

This alignment enables reinforcement learning (RL) policy optimization (e.g., PPO variants) to use rollout probabilities directly for KL regularization without requiring re-forwarding through distinct training engines. This ensures stable RL training over long rollouts and maintains SOTA metrics in downstream reasoning tasks.

The adaptive policy update is characterized by

θJ(θ)=Exπrollout[θmin(πrollout(x,θ)πrollout(x,θold)A^,clip(πrollout(x,θ)πrollout(x,θold),1ϵ,1+ϵ)A^)]\nabla_\theta J(\theta) = \mathbb{E}_{x\sim\pi_\text{rollout}} \left[ \nabla_\theta \min \left( \frac{\pi_\text{rollout}(x, \theta)}{\pi_\text{rollout}(x, \theta_\text{old})} \hat{A}, \operatorname{clip}\left(\frac{\pi_\text{rollout}(x, \theta)}{\pi_\text{rollout}(x, \theta_\text{old})}, 1-\epsilon, 1+\epsilon\right) \hat{A} \right) \right]

where A^\hat{A} is the advantage term.

7. Reasoning Benchmark Performance

Ring-mini-linear-2.0 demonstrates competitive results across multiple challenging reasoning benchmarks, including mathematical (AIME, OlympiadBench), coding (Humaneval+, LiveCodeBench), and general logic tasks (DROP, GPQA-Diamond).

Despite having fewer activated parameters (957M non-embeddings) compared to other models such as Ring-mini-2.0 (softmax-dominant), Qwen3-8B-Thinking, and GPT-OSS-20B-Medium, Ring-mini-linear-2.0 offers similar or superior reasoning accuracy in long-context scenarios. The hybrid architecture’s ability to exploit both efficient and expressive attention mechanisms underlies this performance.

Model Total Params Activated Params Inference Cost (vs 32B dense) Long-Context Efficiency Benchmark Reasoning
Ring-mini-linear-2.0 16B 957M 1/10 High SOTA
Ring-mini-2.0 16B >957M Higher Lower Comparable
Qwen3-8B-Thinking 8B 8B ∼1 Standard Comparable
GPT-OSS-20B-Medium 20B 20B ∼1 Standard Comparable

References

Summary

Ring-mini-linear-2.0 exemplifies a hybrid, highly efficient transformer model engineered for long-context tasks. By systematically optimizing the ratio of linear to softmax attention, leveraging operator-level FP8 training improvements, and employing sparse MoE designs, it achieves substantial resource savings without compromising on state-of-the-art reasoning accuracy. The alignment of training and inference—especially for reinforcement learning optimization—further distinguishes its approach among contemporary LLMs targeting long-form or complex reasoning workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Ring-mini-linear-2.0.