Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongCat-Flash-Thinking-2601: Scalable MoE Transformer

Updated 28 January 2026
  • LFT-2601 is a 560B-parameter Mixture-of-Experts Transformer that employs sparse expert routing, Zigzag Attention, and Heavy Thinking Mode for long-context agentic reasoning.
  • The model’s training pipeline integrates domain-parallel expert training, reinforcement learning with noise injection, and model fusion to optimize performance across diverse tool-integrated tasks.
  • LFT-2601 achieves state-of-the-art results in open-weight agentic reasoning and tool use while delivering significant speedups and reduced memory overhead in long-context applications.

LongCat-Flash-Thinking-2601 (LFT-2601) is a 560-billion-parameter open-source Mixture-of-Experts (MoE) Transformer designed for high-efficiency, robust, and generalizable agentic reasoning across long contexts and complex tool-integrated environments. Engineered through the integration of sparse expert routing, long-horizon context scaling, domain-parallel training, and large-scale asynchronous reinforcement learning, LFT-2601 achieves state-of-the-art performance in open-weight agentic reasoning, search, and tool-augmented tasks. The model combines innovations in architecture (e.g., Zigzag/Streaming Sparse Attention for million-token contexts, Heavy Thinking Mode for parallel solution exploration), training (domain-specialized expert distillation and fusion, noise-robust RL), and data construction (automatic environment and task curriculum generation).

1. Model Architecture and Sparse Expert Routing

LFT-2601 is structured as a 560B-parameter Transformer with a Mixture-of-Experts backbone: only an average of 27B parameters are activated per token via top-2 sparse expert selection. Each MoE layer comprises E (typically 64–128) independent two-layer MLP experts; a learned linear router computes per-token logits g=Wgh+bgg = W_g h + b_g, applies softmax, and activates the top-2 experts for each token. The MoE output is given by MoE(h)=pe1Experte1(h)+pe2Experte2(h)\mathrm{MoE}(h) = p_{e_1}\mathrm{Expert}_{e_1}(h) + p_{e_2}\mathrm{Expert}_{e_2}(h), reducing per-token compute. Load balancing is regularized by an auxiliary loss over token distribution across experts. Dense (standard FFN) and sparse-MoE layers alternate, and architectural optimizations include zero-computation experts and shortcut-connected experts for trivial routing scenarios.

The base stack comprises input embeddings, rotary positional encoding (with YaRN for ultra-long context), Multi-Head Self-Attention (MHSA), Feed-Forward sublayers, and LayerNorm with residual connections. "Zigzag Attention" interleaves Streaming Sparse Attention (SSA) and full attention layers, permitting subquadratic compute scaling to million-token contexts without full-model retraining (Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025).

2. Long-Context Scaling via Zigzag Attention

For context lengths up to 1 million tokens, LFT-2601 employs LongCat ZigZag Attention (LoZA) (Zhang et al., 30 Dec 2025). In this scheme, 50% of MLA (Multi-Layer Attention) modules are converted to sparse SSA layers using a two-tier "sink + local" pattern: each query attends to ss global sink blocks and ll local blocks (typical settings: s=1s=1, l=7l=7, block size b=128b=128), yielding 10241\,024 attended keys per token. Sparse MLA is introduced using a gated calibration phase: learned scalar αi\alpha_i controls the interpolation between dense and sparse outputs (O~i=αiOi+(1αi)Oi\tilde{O}_i = \alpha_i O_i + (1-\alpha_i) O'_i), sorted post-calibration to identify sparsifiable layers.

Mid-training involves freezing non-αi\alpha_i weights, calibrating on $1$B tokens, permanently sparsifying selected layers, and unfreezing the model for continued long-context curriculum training. This process delivers linear-scaling memory/time cost (O(nd1,024)O(n \cdot d \cdot 1,024)), 30–83% speedups in prefill/decode rates, and reduced DRAM utilization at all context scales (Zhang et al., 30 Dec 2025). The resulting LFT-2601 model can efficiently handle multipart codebases and mathematical proofs spanning up to 1M tokens.

3. Training Pipeline: Domain-Parallel Expert Training, RL, and Fusion

The LFT-2601 training framework follows a staged progression:

  • Mid-training Curriculum: The cold-start phase synthesizes balanced, filtered datasets from LongCat-Flash-Base and reasoning-intensive corpora, with curriculum mixing to incrementally increase complex reasoning exposure. Competence is monitored by repeated-sampling pass@k metrics.
  • Supervised Fine-Tuning: Three streams—general reasoning, formal (automated theorem proving), and agentic/tool-based reasoning—undergo SFT on carefully stratified data. Instruction- and tool-based queries are selected through model-driven voting, deduplication, and algorithmic filtering.
  • Domain-Parallel RL: STEM, Code, and Agentic RL experts are trained in parallel environments using DORA (Dynamic ORchestration for Asynchronous rollout), with GRPO/GSPO objectives and stability augmentations. Notable features are streaming rollouts, multi-version policy staleness control, token-level normalization, triplet clipping, truncated importance sampling, and domain-specific reward models. Each expert is trained on domain-tuned context lengths (e.g., 48k–64k tokens) and RL objectives are scheduled with domain-adaptive clip and normalization parameters (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026).
  • Model Fusion: Parameters from converged domain experts are fused using an adaptation of Ties-merging, dropout-based pruning (DARE-style), and minority-direction update erasure (SCE-style), resulting in a fused, nearly Pareto-optimal generalist. A short global PPO RL pass over open-domain tasks finalizes the process.

4. Robustness via Noise Modeling and Large-Scale RL

LFT-2601 targets real-world deployment scenarios by incorporating principled noise modeling during RL. Noise sources are explicitly characterized as:

  • Instruction noise: user ambiguity, typos, rephrasings
  • Tool noise: execution errors, partial results, inconsistent/incomplete APIs

Noise is decomposed at syntactic/semantic and turn/environment levels. During agentic RL, controlled noise is injected into both user instructions and tool outputs: instrt=instrt+αξt\mathrm{instr}'_t = \mathrm{instr}_t + \alpha \xi_t, and outt=outtnoiset\mathrm{out}'_t = \mathrm{out}_t \oplus \mathrm{noise}_t, sampled from empirical distributions. The curriculum schedule ramps α\alpha to increase robustness; the RL objective combines clean and noisy rollouts: J(θ)=JGSPO(θ;noise=0)+λJGSPO(θ;noise=α)J(\theta) = J_\mathrm{GSPO}(\theta;\,\mathrm{noise}=0) + \lambda J_\mathrm{GSPO}(\theta; \mathrm{noise}=\alpha).

Ablation benchmarks show significant improvements in generalization under noise injection. For example, VitaBench-Noise improves from 6.3 (cold start) to 20.5 (noise-trained), and τ²-Bench-Noise from 58.8 to 67.1 (Team et al., 23 Jan 2026).

5. Efficient Reasoning: FlashThink Early-Exit Mechanism

LFT-2601 implements FlashThink (Jiang et al., 20 May 2025), a reasoning phase early-exit method: during chain-of-thought (CoT) inference, reasoning output is chunked at delimiters; after each chunk, a lightweight verification model π (e.g., Qwen2.5-7B-Instruct, binary classification) evaluates sufficiency. If π(x | c₁…cᵢ) = True, the model halts reasoning and issues the answer; otherwise, generation continues.

Verification models are fine-tuned (FT²) on positive/negative labels derived from partial reasoning chain correctness. The mathematical exit criterion is: if pi=Pφ(yesx,c1ci)τp_i = P_φ(\mathrm{yes} | x, c₁…cᵢ) ≥ \tau (default τ=0.5\tau=0.5), exit and answer. FlashThink provides up to 94% reduction in generated reasoning trace length, with negligible loss in final accuracy. For QwQ-32B and DeepSeek-R1, mean reasoning length was reduced by over 77% with no discernible accuracy reduction. π-model architecture and threshold τ\tau are critical for the efficiency/accuracy trade-off (Jiang et al., 20 May 2025).

6. Heavy Thinking Mode and Adaptive Test-Time Scaling

Heavy Thinking Mode enables LFT-2601 to scale test-time compute for especially challenging queries (Team et al., 23 Jan 2026):

  • Stage I: Parallel Exploration: BB reasoning trajectories {τi}\{\tau_i\} are generated in parallel.
  • Stage II: Summary & Refinement: A summarizer module SS aggregates the set to a final answer yy^* via RL-enhanced voting or synthesis.

Resource allocation is tunable: for a compute budget CC, B=C/(T+S)B = \lfloor C/(T+S) \rfloor parallel chains of length TT are generated, and summary cost is SS tokens. Higher CC enables deeper (longer chains) and/or wider (more chains) parallel search, improving solution robustness and reasoning reliability. The summarizer is itself RL-finetuned for optimal selection/aggregation.

7. Benchmark Performance and Comparative Results

LFT-2601 demonstrates state-of-the-art open-source results across agentic reasoning, agentic tool use, and tool-integrated reasoning contexts:

Benchmark Best OSS GPT-5 Claude LFT-2601
BrowseComp (w/ ctx mgmt) 73.1 65.8 65.8 73.1
RWSearch 79.5 82.0 75.5 79.5
τ²-Bench 88.6 98.9 88.9 88.2
τ²-Bench-Noise 67.1 65.0 59.4 67.1
VitaBench-Noise 20.5 19.0 20.3 20.5
AIME-25 (Avg@16, tool reasoning) 100.0

Further, on AIME-25 tool-integrated tasks average token consumption drops from 19,653 to 6,965 (a 64.5% reduction) without accuracy loss. General QA and code benchmarks indicate robust competitiveness with both open- and closed-weight leaders (Team et al., 23 Jan 2026, Team et al., 23 Sep 2025).


LongCat-Flash-Thinking-2601 establishes a comprehensive paradigm for open-weight agentic reasoning LLMs: combining scalable MoE routing, long-context and reasoning trace optimization, robust RL, and dynamic, noise-tolerant training methodologies, it delivers efficiency, robustness, and broad generalization at scales previously only accessible to closed-weight models (Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025, Team et al., 23 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (4)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongCat-Flash-Thinking-2601 (LFT-2601).