Ring-mini-linear-2.0: Hybrid Transformer

Updated 23 October 2025

Ring-mini-linear-2.0 is a hybrid language model that combines linear and softmax attention layers within a transformer architecture for efficient long-context reasoning.
It leverages a sparse Mixture-of-Experts design to activate only a fraction of its 16B parameters, drastically reducing computational and memory overhead.
Innovative operator fusion with FP8 training and systematic architecture tuning deliver state-of-the-art performance on complex reasoning benchmarks while optimizing resource efficiency.

Ring-mini-linear-2.0 is a 16-billion parameter LLM representing the second generation of the Ring-linear family, designed for efficient long-context reasoning. It combines linear and softmax attention mechanisms in a hybrid transformer architecture and adopts a sparse Mixture-of-Experts (MoE) design to optimize inference cost, memory usage, and training efficiency. The model achieves substantial reductions in computational resources relative to comparably sized dense models and preserves high performance across a variety of complex reasoning benchmarks.

1. Hybrid Attention Architecture

Ring-mini-linear-2.0 employs a structured hybrid attention scheme, interleaving multiple linear attention layers with periodic softmax attention layers. Specifically, the transformer stack is divided into layer groups of size M = 4: four Lightning Attention (linear attention) layers are followed by one softmax attention layer. Lightning Attention computes token output $\mathbf{O}$ via

$\mathbf{O} = \mathbf{Q} (\mathbf{K}^\top \mathbf{V})$

with query $\mathbf{Q}$ , key $\mathbf{K}$ , and value $\mathbf{V}$ matrices in $\mathbb{R}^{n\times d}$ , $n$ the sequence length, and $d$ the feature dimension. For recurrent inference, at step $t$ , the output is

$o_t = q_t \sum_{s \leq t} \lambda^{t-s} k_s^\top v_s$

or recursively,

$\text{kv}_0 = 0 \qquad \text{kv}_t = \lambda \cdot \text{kv}_{t-1} + k_t^\top v_t \qquad o_t = q_t \cdot (\text{kv}_t)$

where $\text{kv}_t \in \mathbb{R}^{d \times d}$ is a compressed, constant-size key–value cache. This configuration ensures that the cache requirements grow constant with sequence length in linear blocks, as opposed to the $O(n)$ scaling of softmax-based caches.

The periodic inclusion of softmax attention layers after four linear blocks maintains the capacity for tasks (e.g., information retrieval) where linear attention alone degrades accuracy. This forms the architectural basis of the hybrid regime, offering a tunable efficiency–performance trade-off.

2. Parameterization, Activation Sparsity, and MoE Design

The model consists of 16 billion total parameters. Critically, only approximately 1.6 billion parameters are activated at each inference step (957 million after excluding embedding parameters). This sparsity is achieved using a Mixture-of-Experts (MoE) approach, where only a subset of experts are conditionally activated depending on the input. MoE routing is deterministic, ensuring reproducible inference and reduced activation overhead per forward pass.

This design enables the model to offer the representational capacity of a large model while limiting compute and memory footprints during both training and inference.

3. Computational and Memory Efficiency

By leveraging Lightning Attention as the linear attention mechanism, Ring-mini-linear-2.0 exhibits constant-size KV cache requirements in linear layers and linear time/space scaling with context length. For entire long sequences, this eliminates the dominating $O(n^2)$ attention cost and I/O load associated with pure softmax attention.

The combined hybrid scheme reduces inference cost by an order of magnitude: compared to a 32B dense model, inference is $\sim$ 1/10 the computational cost, and relative to the previous generation Ring series, the cost is reduced by more than 50% (Team et al., 22 Oct 2025).

Efficiency is further improved by the model architecture’s optimized group structure, balancing the minimal use of softmax attention with maximal exploitation of the Lightning Attention’s linear scaling.

4. Training Optimization and Operator Fusion

Training efficiency is augmented by a custom operator library, “linghe,” targeting the FP8 numeric format. “linghe” fuses major pipeline operations—such as expert gating, normalization, and quantization—into single kernels, minimizing memory-moving overhead and kernel launches. FP8 is employed for GEMM (matrix multiplication) operations, offering higher throughput than BF16 and permitting greater micro-batch sizes within fixed memory budgets.

Additionally, the recomputation strategy is fine-grained and locality-aware, enabling deterministic MoE routing and fused quantization, as well as the storage of critical tensors (e.g., the KV cache) in FP32 for numerical stability. Cumulatively, these improvements result in a 50% increase in overall training throughput.

5. Systematic Hybrid Architecture Tuning

The optimal layering of linear and softmax blocks was determined by extensive ablation and scaling law experiments, systematically varying the group size parameter $M$ . Resulting data indicated that $M = 4$ yields minimal training loss and optimal performance/cost trade-off. Models with $M = 0$ revert to traditional softmax-only architectures, losing efficiency, while $M$ too large (too few softmax layers) degrades expressivity and retrieval performance.

This parameterization supports fine-grained deployment tailoring to specific hardware or application constraints.

6. Alignment for Reinforcement Learning and Stable Optimization

A critical design consideration in Ring-mini-linear-2.0 is the alignment of the training and inference operator pipelines. Key modules are shared and deterministic, including:

FP32 format for KV caches;
customized GEMM handling for the LM head;
deterministic MoE paths.

This alignment enables reinforcement learning (RL) policy optimization (e.g., PPO variants) to use rollout probabilities directly for KL regularization without requiring re-forwarding through distinct training engines. This ensures stable RL training over long rollouts and maintains SOTA metrics in downstream reasoning tasks.

The adaptive policy update is characterized by

$\nabla_\theta J(\theta) = \mathbb{E}_{x\sim\pi_\text{rollout}} \left[ \nabla_\theta \min \left( \frac{\pi_\text{rollout}(x, \theta)}{\pi_\text{rollout}(x, \theta_\text{old})} \hat{A}, \operatorname{clip}\left(\frac{\pi_\text{rollout}(x, \theta)}{\pi_\text{rollout}(x, \theta_\text{old})}, 1-\epsilon, 1+\epsilon\right) \hat{A} \right) \right]$

where $\hat{A}$ is the advantage term.

7. Reasoning Benchmark Performance

Ring-mini-linear-2.0 demonstrates competitive results across multiple challenging reasoning benchmarks, including mathematical (AIME, OlympiadBench), coding (Humaneval+, LiveCodeBench), and general logic tasks (DROP, GPQA-Diamond).

Despite having fewer activated parameters (957M non-embeddings) compared to other models such as Ring-mini-2.0 (softmax-dominant), Qwen3-8B-Thinking, and GPT-OSS-20B-Medium, Ring-mini-linear-2.0 offers similar or superior reasoning accuracy in long-context scenarios. The hybrid architecture’s ability to exploit both efficient and expressive attention mechanisms underlies this performance.

Model	Total Params	Activated Params	Inference Cost (vs 32B dense)	Long-Context Efficiency	Benchmark Reasoning
Ring-mini-linear-2.0	16B	957M	1/10	High	SOTA
Ring-mini-2.0	16B	>957M	Higher	Lower	Comparable
Qwen3-8B-Thinking	8B	8B	∼1	Standard	Comparable
GPT-OSS-20B-Medium	20B	20B	∼1	Standard	Comparable

References

“Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning” (Team et al., 22 Oct 2025)

Summary

Ring-mini-linear-2.0 exemplifies a hybrid, highly efficient transformer model engineered for long-context tasks. By systematically optimizing the ratio of linear to softmax attention, leveraging operator-level FP8 training improvements, and employing sparse MoE designs, it achieves substantial resource savings without compromising on state-of-the-art reasoning accuracy. The alignment of training and inference—especially for reinforcement learning optimization—further distinguishes its approach among contemporary LLMs targeting long-form or complex reasoning workloads.

PDF Markdown Chat (Pro)

References (1)

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Ring-mini-linear-2.0.