LongCat-Flash-Thinking-2601: Scalable MoE Transformer

Updated 28 January 2026

LFT-2601 is a 560B-parameter Mixture-of-Experts Transformer that employs sparse expert routing, Zigzag Attention, and Heavy Thinking Mode for long-context agentic reasoning.
The model’s training pipeline integrates domain-parallel expert training, reinforcement learning with noise injection, and model fusion to optimize performance across diverse tool-integrated tasks.
LFT-2601 achieves state-of-the-art results in open-weight agentic reasoning and tool use while delivering significant speedups and reduced memory overhead in long-context applications.

LongCat-Flash-Thinking-2601 (LFT-2601) is a 560-billion-parameter open-source Mixture-of-Experts (MoE) Transformer designed for high-efficiency, robust, and generalizable agentic reasoning across long contexts and complex tool-integrated environments. Engineered through the integration of sparse expert routing, long-horizon context scaling, domain-parallel training, and large-scale asynchronous reinforcement learning, LFT-2601 achieves state-of-the-art performance in open-weight agentic reasoning, search, and tool-augmented tasks. The model combines innovations in architecture (e.g., Zigzag/Streaming Sparse Attention for million-token contexts, Heavy Thinking Mode for parallel solution exploration), training (domain-specialized expert distillation and fusion, noise-robust RL), and data construction (automatic environment and task curriculum generation).

1. Model Architecture and Sparse Expert Routing

LFT-2601 is structured as a 560B-parameter Transformer with a Mixture-of-Experts backbone: only an average of 27B parameters are activated per token via top-2 sparse expert selection. Each MoE layer comprises E (typically 64–128) independent two-layer MLP experts; a learned linear router computes per-token logits $g = W_g h + b_g$ , applies softmax, and activates the top-2 experts for each token. The MoE output is given by $\mathrm{MoE}(h) = p_{e_1}\mathrm{Expert}_{e_1}(h) + p_{e_2}\mathrm{Expert}_{e_2}(h)$ , reducing per-token compute. Load balancing is regularized by an auxiliary loss over token distribution across experts. Dense (standard FFN) and sparse-MoE layers alternate, and architectural optimizations include zero-computation experts and shortcut-connected experts for trivial routing scenarios.

The base stack comprises input embeddings, rotary positional encoding (with YaRN for ultra-long context), Multi-Head Self-Attention (MHSA), Feed-Forward sublayers, and LayerNorm with residual connections. "Zigzag Attention" interleaves Streaming Sparse Attention (SSA) and full attention layers, permitting subquadratic compute scaling to million-token contexts without full-model retraining (Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025).

2. Long-Context Scaling via Zigzag Attention

For context lengths up to 1 million tokens, LFT-2601 employs LongCat ZigZag Attention (LoZA) (Zhang et al., 30 Dec 2025). In this scheme, 50% of MLA (Multi-Layer Attention) modules are converted to sparse SSA layers using a two-tier "sink + local" pattern: each query attends to $s$ global sink blocks and $l$ local blocks (typical settings: $s=1$ , $l=7$ , block size $b=128$ ), yielding $1\,024$ attended keys per token. Sparse MLA is introduced using a gated calibration phase: learned scalar $\alpha_i$ controls the interpolation between dense and sparse outputs ( $\tilde{O}_i = \alpha_i O_i + (1-\alpha_i) O'_i$ ), sorted post-calibration to identify sparsifiable layers.

Mid-training involves freezing non- $\alpha_i$ weights, calibrating on $1$B tokens, permanently sparsifying selected layers, and unfreezing the model for continued long-context curriculum training. This process delivers linear-scaling memory/time cost ( $O(n \cdot d \cdot 1,024)$ ), 30–83% speedups in prefill/decode rates, and reduced DRAM utilization at all context scales (Zhang et al., 30 Dec 2025). The resulting LFT-2601 model can efficiently handle multipart codebases and mathematical proofs spanning up to 1M tokens.

3. Training Pipeline: Domain-Parallel Expert Training, RL, and Fusion

The LFT-2601 training framework follows a staged progression:

Mid-training Curriculum: The cold-start phase synthesizes balanced, filtered datasets from LongCat-Flash-Base and reasoning-intensive corpora, with curriculum mixing to incrementally increase complex reasoning exposure. Competence is monitored by repeated-sampling pass@k metrics.
Supervised Fine-Tuning: Three streams—general reasoning, formal (automated theorem proving), and agentic/tool-based reasoning—undergo SFT on carefully stratified data. Instruction- and tool-based queries are selected through model-driven voting, deduplication, and algorithmic filtering.
Domain-Parallel RL: STEM, Code, and Agentic RL experts are trained in parallel environments using DORA (Dynamic ORchestration for Asynchronous rollout), with GRPO/GSPO objectives and stability augmentations. Notable features are streaming rollouts, multi-version policy staleness control, token-level normalization, triplet clipping, truncated importance sampling, and domain-specific reward models. Each expert is trained on domain-tuned context lengths (e.g., 48k–64k tokens) and RL objectives are scheduled with domain-adaptive clip and normalization parameters (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026).
Model Fusion: Parameters from converged domain experts are fused using an adaptation of Ties-merging, dropout-based pruning (DARE-style), and minority-direction update erasure (SCE-style), resulting in a fused, nearly Pareto-optimal generalist. A short global PPO RL pass over open-domain tasks finalizes the process.

4. Robustness via Noise Modeling and Large-Scale RL

LFT-2601 targets real-world deployment scenarios by incorporating principled noise modeling during RL. Noise sources are explicitly characterized as:

Instruction noise: user ambiguity, typos, rephrasings
Tool noise: execution errors, partial results, inconsistent/incomplete APIs

Noise is decomposed at syntactic/semantic and turn/environment levels. During agentic RL, controlled noise is injected into both user instructions and tool outputs: $\mathrm{instr}'_t = \mathrm{instr}_t + \alpha \xi_t$ , and $\mathrm{out}'_t = \mathrm{out}_t \oplus \mathrm{noise}_t$ , sampled from empirical distributions. The curriculum schedule ramps $\alpha$ to increase robustness; the RL objective combines clean and noisy rollouts: $J(\theta) = J_\mathrm{GSPO}(\theta;\,\mathrm{noise}=0) + \lambda J_\mathrm{GSPO}(\theta; \mathrm{noise}=\alpha)$ .

Ablation benchmarks show significant improvements in generalization under noise injection. For example, VitaBench-Noise improves from 6.3 (cold start) to 20.5 (noise-trained), and τ²-Bench-Noise from 58.8 to 67.1 (Team et al., 23 Jan 2026).

5. Efficient Reasoning: FlashThink Early-Exit Mechanism

LFT-2601 implements FlashThink (Jiang et al., 20 May 2025), a reasoning phase early-exit method: during chain-of-thought (CoT) inference, reasoning output is chunked at delimiters; after each chunk, a lightweight verification model π (e.g., Qwen2.5-7B-Instruct, binary classification) evaluates sufficiency. If π(x | c₁…cᵢ) = True, the model halts reasoning and issues the answer; otherwise, generation continues.

Verification models are fine-tuned (FT²) on positive/negative labels derived from partial reasoning chain correctness. The mathematical exit criterion is: if $p_i = P_φ(\mathrm{yes} | x, c₁…cᵢ) ≥ \tau$ (default $\tau=0.5$ ), exit and answer. FlashThink provides up to 94% reduction in generated reasoning trace length, with negligible loss in final accuracy. For QwQ-32B and DeepSeek-R1, mean reasoning length was reduced by over 77% with no discernible accuracy reduction. π-model architecture and threshold $\tau$ are critical for the efficiency/accuracy trade-off (Jiang et al., 20 May 2025).

6. Heavy Thinking Mode and Adaptive Test-Time Scaling

Heavy Thinking Mode enables LFT-2601 to scale test-time compute for especially challenging queries (Team et al., 23 Jan 2026):

Stage I: Parallel Exploration: $B$ reasoning trajectories $\{\tau_i\}$ are generated in parallel.
Stage II: Summary & Refinement: A summarizer module $S$ aggregates the set to a final answer $y^*$ via RL-enhanced voting or synthesis.

Resource allocation is tunable: for a compute budget $C$ , $B = \lfloor C/(T+S) \rfloor$ parallel chains of length $T$ are generated, and summary cost is $S$ tokens. Higher $C$ enables deeper (longer chains) and/or wider (more chains) parallel search, improving solution robustness and reasoning reliability. The summarizer is itself RL-finetuned for optimal selection/aggregation.

7. Benchmark Performance and Comparative Results

LFT-2601 demonstrates state-of-the-art open-source results across agentic reasoning, agentic tool use, and tool-integrated reasoning contexts:

Benchmark	Best OSS	GPT-5	Claude	LFT-2601
BrowseComp (w/ ctx mgmt)	73.1	65.8	65.8	73.1
RWSearch	79.5	82.0	75.5	79.5
τ²-Bench	88.6	98.9	88.9	88.2
τ²-Bench-Noise	67.1	65.0	59.4	67.1
VitaBench-Noise	20.5	19.0	20.3	20.5
AIME-25 (Avg@16, tool reasoning)	–	–	–	100.0

Further, on AIME-25 tool-integrated tasks average token consumption drops from 19,653 to 6,965 (a 64.5% reduction) without accuracy loss. General QA and code benchmarks indicate robust competitiveness with both open- and closed-weight leaders (Team et al., 23 Jan 2026, Team et al., 23 Sep 2025).

LongCat-Flash-Thinking-2601 establishes a comprehensive paradigm for open-weight agentic reasoning LLMs: combining scalable MoE routing, long-context and reasoning trace optimization, robust RL, and dynamic, noise-tolerant training methodologies, it delivers efficiency, robustness, and broad generalization at scales previously only accessible to closed-weight models (Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025, Team et al., 23 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (4)

LongCat-Flash-Thinking-2601 Technical Report (2026)

Efficient Context Scaling with LongCat ZigZag Attention (2025)

LongCat-Flash-Thinking Technical Report (2025)

FlashThink: An Early Exit Method For Efficient Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongCat-Flash-Thinking-2601 (LFT-2601).

LongCat-Flash-Thinking-2601: Scalable MoE Transformer

1. Model Architecture and Sparse Expert Routing

2. Long-Context Scaling via Zigzag Attention

3. Training Pipeline: Domain-Parallel Expert Training, RL, and Fusion

4. Robustness via Noise Modeling and Large-Scale RL

5. Efficient Reasoning: FlashThink Early-Exit Mechanism

6. Heavy Thinking Mode and Adaptive Test-Time Scaling

7. Benchmark Performance and Comparative Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LongCat-Flash-Thinking-2601: Scalable MoE Transformer

1. Model Architecture and Sparse Expert Routing

2. Long-Context Scaling via Zigzag Attention

3. Training Pipeline: Domain-Parallel Expert Training, RL, and Fusion

4. Robustness via Noise Modeling and Large-Scale RL

5. Efficient Reasoning: FlashThink Early-Exit Mechanism

6. Heavy Thinking Mode and Adaptive Test-Time Scaling

7. Benchmark Performance and Comparative Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research