LongCat-Flash-Thinking-2601: Scalable MoE Transformer
- LFT-2601 is a 560B-parameter Mixture-of-Experts Transformer that employs sparse expert routing, Zigzag Attention, and Heavy Thinking Mode for long-context agentic reasoning.
- The model’s training pipeline integrates domain-parallel expert training, reinforcement learning with noise injection, and model fusion to optimize performance across diverse tool-integrated tasks.
- LFT-2601 achieves state-of-the-art results in open-weight agentic reasoning and tool use while delivering significant speedups and reduced memory overhead in long-context applications.
LongCat-Flash-Thinking-2601 (LFT-2601) is a 560-billion-parameter open-source Mixture-of-Experts (MoE) Transformer designed for high-efficiency, robust, and generalizable agentic reasoning across long contexts and complex tool-integrated environments. Engineered through the integration of sparse expert routing, long-horizon context scaling, domain-parallel training, and large-scale asynchronous reinforcement learning, LFT-2601 achieves state-of-the-art performance in open-weight agentic reasoning, search, and tool-augmented tasks. The model combines innovations in architecture (e.g., Zigzag/Streaming Sparse Attention for million-token contexts, Heavy Thinking Mode for parallel solution exploration), training (domain-specialized expert distillation and fusion, noise-robust RL), and data construction (automatic environment and task curriculum generation).
1. Model Architecture and Sparse Expert Routing
LFT-2601 is structured as a 560B-parameter Transformer with a Mixture-of-Experts backbone: only an average of 27B parameters are activated per token via top-2 sparse expert selection. Each MoE layer comprises E (typically 64–128) independent two-layer MLP experts; a learned linear router computes per-token logits , applies softmax, and activates the top-2 experts for each token. The MoE output is given by , reducing per-token compute. Load balancing is regularized by an auxiliary loss over token distribution across experts. Dense (standard FFN) and sparse-MoE layers alternate, and architectural optimizations include zero-computation experts and shortcut-connected experts for trivial routing scenarios.
The base stack comprises input embeddings, rotary positional encoding (with YaRN for ultra-long context), Multi-Head Self-Attention (MHSA), Feed-Forward sublayers, and LayerNorm with residual connections. "Zigzag Attention" interleaves Streaming Sparse Attention (SSA) and full attention layers, permitting subquadratic compute scaling to million-token contexts without full-model retraining (Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025).
2. Long-Context Scaling via Zigzag Attention
For context lengths up to 1 million tokens, LFT-2601 employs LongCat ZigZag Attention (LoZA) (Zhang et al., 30 Dec 2025). In this scheme, 50% of MLA (Multi-Layer Attention) modules are converted to sparse SSA layers using a two-tier "sink + local" pattern: each query attends to global sink blocks and local blocks (typical settings: , , block size ), yielding attended keys per token. Sparse MLA is introduced using a gated calibration phase: learned scalar controls the interpolation between dense and sparse outputs (), sorted post-calibration to identify sparsifiable layers.
Mid-training involves freezing non- weights, calibrating on $1$B tokens, permanently sparsifying selected layers, and unfreezing the model for continued long-context curriculum training. This process delivers linear-scaling memory/time cost (), 30–83% speedups in prefill/decode rates, and reduced DRAM utilization at all context scales (Zhang et al., 30 Dec 2025). The resulting LFT-2601 model can efficiently handle multipart codebases and mathematical proofs spanning up to 1M tokens.
3. Training Pipeline: Domain-Parallel Expert Training, RL, and Fusion
The LFT-2601 training framework follows a staged progression:
- Mid-training Curriculum: The cold-start phase synthesizes balanced, filtered datasets from LongCat-Flash-Base and reasoning-intensive corpora, with curriculum mixing to incrementally increase complex reasoning exposure. Competence is monitored by repeated-sampling pass@k metrics.
- Supervised Fine-Tuning: Three streams—general reasoning, formal (automated theorem proving), and agentic/tool-based reasoning—undergo SFT on carefully stratified data. Instruction- and tool-based queries are selected through model-driven voting, deduplication, and algorithmic filtering.
- Domain-Parallel RL: STEM, Code, and Agentic RL experts are trained in parallel environments using DORA (Dynamic ORchestration for Asynchronous rollout), with GRPO/GSPO objectives and stability augmentations. Notable features are streaming rollouts, multi-version policy staleness control, token-level normalization, triplet clipping, truncated importance sampling, and domain-specific reward models. Each expert is trained on domain-tuned context lengths (e.g., 48k–64k tokens) and RL objectives are scheduled with domain-adaptive clip and normalization parameters (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026).
- Model Fusion: Parameters from converged domain experts are fused using an adaptation of Ties-merging, dropout-based pruning (DARE-style), and minority-direction update erasure (SCE-style), resulting in a fused, nearly Pareto-optimal generalist. A short global PPO RL pass over open-domain tasks finalizes the process.
4. Robustness via Noise Modeling and Large-Scale RL
LFT-2601 targets real-world deployment scenarios by incorporating principled noise modeling during RL. Noise sources are explicitly characterized as:
- Instruction noise: user ambiguity, typos, rephrasings
- Tool noise: execution errors, partial results, inconsistent/incomplete APIs
Noise is decomposed at syntactic/semantic and turn/environment levels. During agentic RL, controlled noise is injected into both user instructions and tool outputs: , and , sampled from empirical distributions. The curriculum schedule ramps to increase robustness; the RL objective combines clean and noisy rollouts: .
Ablation benchmarks show significant improvements in generalization under noise injection. For example, VitaBench-Noise improves from 6.3 (cold start) to 20.5 (noise-trained), and τ²-Bench-Noise from 58.8 to 67.1 (Team et al., 23 Jan 2026).
5. Efficient Reasoning: FlashThink Early-Exit Mechanism
LFT-2601 implements FlashThink (Jiang et al., 20 May 2025), a reasoning phase early-exit method: during chain-of-thought (CoT) inference, reasoning output is chunked at delimiters; after each chunk, a lightweight verification model π (e.g., Qwen2.5-7B-Instruct, binary classification) evaluates sufficiency. If π(x | c₁…cᵢ) = True, the model halts reasoning and issues the answer; otherwise, generation continues.
Verification models are fine-tuned (FT²) on positive/negative labels derived from partial reasoning chain correctness. The mathematical exit criterion is: if (default ), exit and answer. FlashThink provides up to 94% reduction in generated reasoning trace length, with negligible loss in final accuracy. For QwQ-32B and DeepSeek-R1, mean reasoning length was reduced by over 77% with no discernible accuracy reduction. π-model architecture and threshold are critical for the efficiency/accuracy trade-off (Jiang et al., 20 May 2025).
6. Heavy Thinking Mode and Adaptive Test-Time Scaling
Heavy Thinking Mode enables LFT-2601 to scale test-time compute for especially challenging queries (Team et al., 23 Jan 2026):
- Stage I: Parallel Exploration: reasoning trajectories are generated in parallel.
- Stage II: Summary & Refinement: A summarizer module aggregates the set to a final answer via RL-enhanced voting or synthesis.
Resource allocation is tunable: for a compute budget , parallel chains of length are generated, and summary cost is tokens. Higher enables deeper (longer chains) and/or wider (more chains) parallel search, improving solution robustness and reasoning reliability. The summarizer is itself RL-finetuned for optimal selection/aggregation.
7. Benchmark Performance and Comparative Results
LFT-2601 demonstrates state-of-the-art open-source results across agentic reasoning, agentic tool use, and tool-integrated reasoning contexts:
| Benchmark | Best OSS | GPT-5 | Claude | LFT-2601 |
|---|---|---|---|---|
| BrowseComp (w/ ctx mgmt) | 73.1 | 65.8 | 65.8 | 73.1 |
| RWSearch | 79.5 | 82.0 | 75.5 | 79.5 |
| τ²-Bench | 88.6 | 98.9 | 88.9 | 88.2 |
| τ²-Bench-Noise | 67.1 | 65.0 | 59.4 | 67.1 |
| VitaBench-Noise | 20.5 | 19.0 | 20.3 | 20.5 |
| AIME-25 (Avg@16, tool reasoning) | – | – | – | 100.0 |
Further, on AIME-25 tool-integrated tasks average token consumption drops from 19,653 to 6,965 (a 64.5% reduction) without accuracy loss. General QA and code benchmarks indicate robust competitiveness with both open- and closed-weight leaders (Team et al., 23 Jan 2026, Team et al., 23 Sep 2025).
LongCat-Flash-Thinking-2601 establishes a comprehensive paradigm for open-weight agentic reasoning LLMs: combining scalable MoE routing, long-context and reasoning trace optimization, robust RL, and dynamic, noise-tolerant training methodologies, it delivers efficiency, robustness, and broad generalization at scales previously only accessible to closed-weight models (Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025, Team et al., 23 Sep 2025).