Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Published 13 Jun 2026 in cs.CL and cs.AI | (2606.15079v1)

Abstract: Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

Abstract PDF Upgrade to Chat

Authors (218)

First 10 authors:

Summary

The paper introduces a migration-based upgrade using hybrid Lightning Attention and MoE to achieve improved token efficiency and throughput.
The methodology employs a four-stage pipeline with QK Norm absorption, partial RoPE adaptation, and extended context up to 256K tokens.
The report demonstrates enhanced agentic RL via KPop and ARouter, yielding higher model efficiency and robust performance across benchmarks.

Ling and Ring 2.6: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Model Motivation and Overview

The Ling-2.6 and Ring-2.6 model family is built to address a critical efficiency-capability tradeoff in agentic LLMs at trillion-parameter scale. Ling-2.6 targets rapid response and high capability per token, serving latency-sensitive workloads; Ring-2.6 is optimized for deep reasoning and robust agent-environment integration. The design leverages architectural migration rather than retraining from scratch, inheriting Ling-2.0's pretrained weights and retrofitting hybrid linear attention combined with advanced MoE. This approach is guided by unified co-design across architecture, data, optimization, and agentic RL environments, which is shown to yield improvements in both model expressivity and deployment efficiency.

Architectural Innovations

Hybrid Linear Attention and MoE Design

Ling-2.6 models adopt a hybrid attention stack with a 7:1 ratio (Lightning Attention:MLA), achieving scaling law optimality and linear complexity for long-context inference. Lightning Attention provides a linear FLOPs profile, MLA compresses KV cache into a low-rank latent, and the hybrid allows context windows up to 262K tokens. MoE in feed-forward layers is fine-grained: 256 routed experts per layer, 8 experts activated per token, with bias-enabled routers and a group routing design with normalized probabilities.

Figure 1: Architecture of Ling-2.6-1T-base, showing hybrid Lightning Attention/MLA and sparse MoE.

Scaling law experiments confirm that the 7:1 hybrid ratio yields best tradeoff between loss and compute, outperforming more aggressive (16:1) or conservative (1:1) ratios.

Figure 2: Scaling law curves for varying hybrid ratios, illustrating optimal loss per FLOPs for 7:1 Lightning:MLA.

Migration Pipeline and Compatibility Handling

The Ling-2.6 migration from Ling-2.0-to-2.6 involves a four-stage pipeline: Lightning Attention conversion with QK Norm absorption, partial RoPE adaptation, MLA conversion, and warmup. To resolve QK Norm/MLA incompatibility, QK Norm is absorbed into $W_q$ , $W_k$ weights via calibration-based fusion, and RoPE is decoupled for partial application.

Pre-training Data and Recipe

Agentic, Domain-specific, and Long-Context Corpus

Pre-training leverages extensive agentic corpora spanning tool use, MCP environments, coding, bash, QA, and repository traces. Long-context corpus is constructed via targeted retrieval, synthesis, and deep rule/model-based defect removal. STEM, web, atomic fact, math, code, and multilingual data are stratified in the mixture.

Multi-Stage Training and Context Extension

Three-phase training: migration, continue, and mid-training. The migration stage retrofits attention while minimizing loss, continue pre-training aggressively switches to high-quality data, and mid-training extends context to 256K tokens.

Figure 3: Multi-stage pre-training pipeline, from migration to mid-training and context window extension.

Post-training Specialization

Token Efficiency and Expert Distillation

Ling-2.6 pursues token efficiency via evolutionary Chain-of-Thought (Evo-CoT), Linguistic Unit Policy Optimization (LPO), shortest-correct-response distillation, and bidirectional preference alignment. These yield significant improvement: Ling-2.6-1T achieves Artificial Analysis Intelligence Index score of 34 using just 16M output tokens (4 $\times$ better token efficiency than Ling-2.0-1T).

Figure 4: Ling-2.6-1T AAI Index—strong token efficiency; Ring-2.6-1T benchmark comparison.

Specialist and Reinforcement Learning Pipelines

Post-training splits into SFT cold start, specialist SFT, RL, and distillation. Reinforcement learning leverages GSPO and dynamic pass rating (DPR) curriculum. For Ring-2.6, KPop replaces IcePop with binary KL masking for mismatch bounding, yielding improved solve rates and reward stability in agentic RL.

Figure 5: Ling-2.6 post-training pipeline exploiting specialist distillation and efficiency rewards.

Figure 6: Ring-2.6 post-training pipeline, highlighting specialist sequencing and adaptive thinking modes.

Agentic RL: KPop and Asynchronous Optimization

KPop introduces symmetric binary KL masking for per-token adaptation instead of fixed global ratio constraint, improving agentic RL stability, especially for tools and coding environments.

Figure 7: Training dynamics of agentic RL on coding task, exhibiting reward growth with KPop.

Figure 8: SWE-bench Verified evaluation, showing agentic RL scaling for Ring-2.6-1T.

RL rollout infrastructure uses ARouter for tail latency minimization, FP8/BF16 quantization with module-aware precision, and partial-rollout asynchronous pipelines. This decouples environment-bound tasks from GPU scheduling, maximizing throughput-stability.

Figure 9: ARouter architecture for global rollout scheduling, optimizing inference/training overlap.

Evaluation and Numerical Results

Ling-2.6 outperforms most open and proprietary instant models across knowledge, reasoning, agentic tool use, instruction-following, and long-context domains. Ring-2.6-1T achieves leading scores on OpenClaw benchmarks (PinchBench 87.60, ClawEval 63.82), competitive on SWE-bench Verified (74.0), robust on GAIA-2 search, and maintains high function calling reliability.

Token efficiency is a primary highlight: Ling-2.6-1T attains AAI index 34 at 16M output tokens, competitive with GPT-5.4 non-reasoning, and 4 $\times$ better than Ling-2.0-1T. Inference throughput is maximized via hybrid attention and sparse MoE, Ling-2.6-flash delivers up to 4 $\times$ faster decode than state-of-the-art baselines.

Figure 10: Prefill and decode throughput of Ling-2.6-flash, demonstrating high inference efficiency.

Infrastructure and System Co-Design

Long-context parallelism employs AllGather-based context parallel for Lightning Attention and fused kernels for varlen sequences, overcoming head-divisibility and kernel launch issues and delivering 68% speedup at 256K context. MoE kernels are updated for 64-bit token counts. Operator fusion and inference optimization (linghe) ensure training-inference alignment and maximize TPS.

Figure 11: Lightning Attention CP optimization schematic.

Figure 12: Linghe inference optimization, aligning fused kernel design for system throughput.

Agentic Coding Environment Construction

Docker environments are generated by conflating LLM-driven repository exploration with tool-restricted MCP scaffolds; this balances exploration with deterministic rule-based validation.

Figure 13: Coding environment construction pipeline—hybrid LLM/tool methodology for secure evaluation.

Conclusion

Ling-2.6 and Ring-2.6 represent a scalable, open model family for practical agentic intelligence, illustrating that architectural migration, hybrid attention, MoE specialization, token efficiency optimization, and infrastructure co-design converge to yield superior efficiency and capability at trillion parameter scale (2606.15079). The demonstrated gains in token efficiency, context scalability, real-environment robustness, and task throughput substantiate strong claims regarding practical agentic deployment. Persistent bottlenecks remain in high-complexity deliberation, factual/repetitive distinction, and long-horizon agentic robustness. Theoretical and practical implications emphasize the necessity for multimodal agentic extension and deeper stack co-design in subsequent generations.