LongCat-Flash-Thinking Transformer MoE
- LongCat-Flash-Thinking is a transformer Mixture-of-Experts model designed to support explicit chain-of-thought reasoning and dynamically scalable inference with long input contexts.
- It employs domain-parallel expert training with reinforcement learning optimization, achieving specialized performance across STEM, coding, formal proofs, and agentic tool-use tasks.
- The integration of LoZA sparse attention enables significant reductions in GPU time and token consumption while maintaining high accuracy on benchmarks such as MMLU-Pro and BeyondAIME.
LongCat-Flash-Thinking refers to a class of large-scale transformer Mixture-of-Experts (MoE) models, exemplified by the LongCat-Flash-Thinking and LongCat-Flash-Thinking-2601 series, that are engineered for explicit, high-efficiency chain-of-thought reasoning, agentic behavior in complex environments, and dynamically scalable inference with practical support for extremely long input contexts. This architecture is characterized by the orchestration of domain-parallel expert training, robust fusion and RL-driven optimization, and the integration of context-efficient sparse attention, delivering best-in-class, open-source performance on formal, agentic, coding, and tool-use benchmarks (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026, Zhang et al., 30 Dec 2025). The following sections detail architectural design, training pipeline, context scaling, advanced inference modes, empirical results, and robustness strategies underpinning LongCat-Flash-Thinking.
1. Model Architecture and MoE Design
LongCat-Flash-Thinking is built on a 560-billion-parameter transformer MoE backbone, with ≈27B active parameters per token. MoE blocks, interleaved with standard transformer layers, feature E experts per layer (typically 64–128), each a two-layer FFN. Gating is performed via a learned projection: for each token input , a gating network computes
with the top- (commonly ) experts selected per token. Only the selected experts process the token, and their outputs, weighted by corresponding gating scores , are summed. This design yields high parameter count without incurring full dense-compute cost, as the average number of active FFNs per token (E[K_FFN]) is maintained via adaptive expert bias scheduling and a load-balance loss, resulting in dynamic compute budgets (18.6B–31.3B, mean 27B; (Team et al., 1 Sep 2025, Team et al., 23 Sep 2025, Team et al., 23 Jan 2026)). Zero-computation experts further allow the model to allocate compute adaptively by routing tokens to identity paths when appropriate.
2. Training Pipeline and Domain-Parallel Optimization
A hallmark of LongCat-Flash-Thinking is its domain-parallel expert training and fusion framework. The process involves:
- Cold-start chain-of-thought (CoT) pretraining on long-form, multi-step reasoning data across STEM, code, formal proofs, and agentic tool-use dialogues. This pretraining employs standard cross-entropy loss over prompt+CoT+answer, without custom regularizers (Team et al., 23 Sep 2025).
- Domain-parallel reinforcement learning (RL): Separate instances of the base model are specialized via RL on disjoint domains (STEM, coding, agentic tool use, and up to 20+ application domains). Each expert is optimized under the GSPO objective, optionally with a diversity regularizer
incentivizing domain specialization in the gating policy (Team et al., 23 Jan 2026).
- Pareto-optimal fusion: Domain-specialized models are merged by summing the normalized RL update deltas from each domain, with sparsity dropout and directional erasure for robust multi-domain generalization. Post-fusion, the unified model is fine-tuned on a mixed-domain RL corpus with the same GSPO objective (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026).
- DORA system: The RL pipeline is orchestrated by Dynamic ORchestration for Asynchronous rollout (DORA), which implements streaming, fully asynchronous PPO with multi-version policies, separating rollout management from training—yielding >3× speedup over synchronous PPO and 1.5× with expert-parallel kernel fusion, deployed on tens of thousands of accelerators (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026).
3. Context Scaling and Attention Optimization
LongCat-Flash incorporates LongCat ZigZag Attention (LoZA), a sparse attention mechanism that enables efficient context window scaling to 1 million tokens while preserving quality. LoZA partitions inputs into non-overlapping blocks and defines a zigzag sparse mask: each query attends to local (7) and sink (1) blocks, realizing attention cost per token with , for major savings relative to dense attention (Zhang et al., 30 Dec 2025). Integration into LongCat-Flash via calibration and retraining identifies layers most tolerant of sparsification (via per-layer learned gating scalars ), replacing 50% of attention layers by LoZA.
This results in LongCat-Flash-Exp, which achieves parity or improvement relative to the full-attention base on MMLU-Pro, GPQA, BBH, GSM8K, HumanEval+, and long-context LongEval, while cutting prefill time by and decode cost by at context. LoZA enables explicit “infinite” working memory for retrieval-augmented generation, long-horizon agentic workflows, and multi-step tool reasoning, with empirical gains of pp MRCR at tokens and pp on BeyondAIME (Zhang et al., 30 Dec 2025).
4. Heavy and Flash Thinking Modes
LongCat-Flash-Thinking exposes configurable thinking modes:
- Flash Thinking: A test-time mode enabling explicit chain-of-thought traces up to a token budget—controllable via API parameter (e.g., thinkingBudget on Gemini-2.5-Flash, thinking.type on Seed1.5-VL), allowing step-wise internal deliberation before generating a final answer (Hong et al., 5 Nov 2025). This does not alter network topology or introduce new layers; it is a prompt- or parameter-level switch.
- Heavy Thinking: An advanced inference configuration in LongCat-Flash-Thinking-2601 involving two stages: (I) parallel generation of independent chain-of-thought trajectories, , and (II) a summarization stage where a secondary model consumes all candidate answers and produces a unified (Team et al., 23 Jan 2026). Ring-buffer context management ensures full history retention, and the summary model is RL fine-tuned to maximize correctness while penalizing hallucinations.
Empirically, Heavy Thinking yields –$7$pp accuracy over self-consistency baselines at fixed inference cost on AIME and agentic tool-use tasks, reflecting gains from deep and parallelized exploration in reasoning-intensive scenarios.
5. Empirical Evaluation and Efficiency
LongCat-Flash-Thinking series models achieve state-of-the-art or near-best open-source results on benchmark suites including MATH-500 (99.2% Mean@1), AIME-24 (93.3%), BeyondAIME (+19.0pp over legacy), and MiniF2F-Test (67.6% Pass@1) (Team et al., 23 Sep 2025). In agentic tasks (AIME-25), reasoning efficiency is demonstrated by a reduction in average token consumption (from 19,653 to 6,965), with accuracy maintained at 90.6% (Team et al., 23 Sep 2025). In clinical multimodal benchmarks, flash thinking yields marginal $0.1$–$2$\% gains (peaking at in captioning), but at a 4–10× latency cost and with reduced consistency (Hong et al., 5 Nov 2025).
LoZA’s context scaling unlocks long-horizon agentic planning: prefill- and decode-intensive tasks run with reduction in GPU time at – context. The combined effect is to support multi-step, persistent context workflows (long-document QA, multi-turn retrieval, agentic simulations) previously infeasible with dense attention.
6. Robustness, Task Construction, and Practical Considerations
Recognizing real-world noise, LongCat-Flash-Thinking-2601 incorporates systematic noise modeling—explicitly injecting instruction ambiguity and tool-execution imperfections during curriculum learning, with a gradually annealed mixture as robustness increases (Team et al., 23 Jan 2026). Performance gaps are minimized under this curriculum, yielding up to pp robustness improvement on noisy agentic benchmarks.
Environment scaling employs synthesized tool-dependency graphs with verifiability-preserving expansion, and rollout budgets are dynamically assigned via knapsack optimization on value functions. Heavy agentic domains and long-tailed tasks are oversampled, ensuring robust skill acquisition across 20+ domains and 10,000 environments.
Practical deployment is optimized through deterministic kernels, quantization, and Single-Batch-Overlap scheduling, supporting and deployment cost of \$/\textrm{1M} output tokens (Team et al., 1 Sep 2025).
7. Limitations and Directions
While flash and heavy thinking unlock explicit and parallelized reasoning, observed accuracy improvements are concentrated in the most complex, open-ended tasks. For routine or closed-ended domains, performance gains are marginal and sometimes offset by increased latency and stochasticity in output. General-purpose models—even with extended CoT or parallel exploration—may lack domain-specific knowledge and precision, suggesting the need for hybrid architectures combining flash/parallel reasoning with external retrieval or knowledge graph augmentation (Hong et al., 5 Nov 2025). Generalizing to other modalities or task classes (e.g., pathology, non-radiology medicine) may require tailored data and evaluation protocols.
The LongCat-Flash-Thinking architecture and methodology illustrate a shift toward agentic, adaptable, and context-scalable reasoning models, establishing a foundation for future advances in open-source, high-efficiency AI systems (Team et al., 1 Sep 2025, Team et al., 23 Sep 2025, Zhang et al., 30 Dec 2025, Team et al., 23 Jan 2026).