Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongCat-Flash: Scalable Mixture-of-Experts Models

Updated 3 July 2026
  • LongCat-Flash is a family of large-scale mixture-of-experts models characterized by sparse activation, long-context attention, and agentic reasoning.
  • It employs innovative architectural elements such as shortcut-connected MoE layers and zero-computation experts to optimize computational efficiency and reduce latency.
  • The design supports extensive multimodal and formal reasoning capabilities, enabling breakthroughs in scalable, efficient, and agentic AI research.

LongCat-Flash is a family of large-scale Mixture-of-Experts (MoE) LLMs and foundation models optimized for scalable efficiency, sparse activation, agentic reasoning, and long-context understanding. Originating with a 560-billion-parameter architecture, LongCat-Flash and its derivatives have introduced a set of architectural and system-level innovations—most notably shortcut-connected MoE layers, zero-computation experts (ZCE), long-context attention mechanisms, agentic RL pipelines, and multimodal extensions—that have defined new Pareto frontiers in LLM and agentic model design. This article reviews the core architecture, scaling methodology, agentic enhancements, context extension techniques, and specialized derivatives, situating LongCat-Flash within the current landscape of efficient, agentic, and multimodal foundation models.

1. Architectural Principles and Model Variants

At the core of LongCat-Flash is a decoder-only Transformer network parameterized with a Mixture-of-Experts feed-forward layer design. Each FFN block is replaced by a shortcut-connected MoE (ScMoE) layer, combining a large pool of standard experts with a substantial allocation of zero-computation experts (ZCE). Sparse expert activation—typically 12 experts per token, with only 8 standard FFNs and 4 ZCEs—enables the model to maintain a total parameter count of 560B while activating only ≈27B parameters per token on average (Team et al., 1 Sep 2025, Team et al., 31 Oct 2025). The "shortcut" connection ensures that dense computations and MoE dispatch/combine operations are overlapped for reduced latency.

Zero-computation experts are designed to route contextually simple tokens (e.g., tokens with low semantic complexity) to paths with no additional FLOPs, maximizing compute allocation for harder tokens while enabling fine-grained dynamic computational budgeting. Gating networks, using softmax projections and dynamic bias updates, achieve stable budget control and load balancing.

LongCat-Flash model suite comprises several notable variants:

  • LongCat-Flash-Base: The original 560B-parameter model, optimized for agentic tasks, reasoning, and coding under sparse activation.
  • LongCat-Flash-Exp: Extends context processing to 1 million tokens via LongCat ZigZag Attention (LoZA).
  • LongCat-Flash-Omni: Adds vision and audio encoders for state-of-the-art open-source multimodal performance, retaining sparse ScMoE backbone and ZCE (Team et al., 31 Oct 2025).
  • LongCat-Flash-Lite: Introduces a new sparsity regime using massive N-gram embeddings, demonstrating superior cost–quality tradeoff compared to MoE at high parameter activation ratios (Liu et al., 29 Jan 2026).
  • LongCat-Flash-Prover: Extends formal reasoning capacity with agentic tool-integrated RL pipelines for Lean4 theorem proving (Wang et al., 22 Mar 2026).

2. Scaling, Stability, and Training Infrastructure

LongCat-Flash employs a comprehensive scaling methodology across data, compute, and infrastructure:

  • Scaling Framework: Hyperparameter transfer using width-scale factors, model-growth stacking (e.g., stacking pre-trained checkpoints twofold to reach desired depth and capacity), and deterministic computation with custom kernels (e.g., deterministic FlashAttention backward, ScatterAdd, MoE permute/unpermute) ensure both training regularization and strict reproducibility.
  • Stability Suite: Control mechanisms include router gradient norm monitoring, activation regularization (hidden z-loss), and Adam optimizer adjustments for large-scale models (ε down to 1e-16).
  • Training Corpus and Regimes: Large-scale pre-training of 20T tokens over 30 days (>98% cluster uptime) is conducted in three main phases: general/coding data, reasoning/coding mid-training, and long-context extension up to 128k tokens. For long-context derivatives, further mid-training on up to 1M-token windows is performed (Team et al., 1 Sep 2025, Zhang et al., 30 Dec 2025).
  • Asynchronous Parallelism: Infrastructure features multi-dimensional parallelism (Expert, Data, Pipeline, Context) and supports model and modality decoupling (critical for multimodal extensions). Checkpoint recovery is highly efficient (<10 min).

3. Sparsity, Efficiency, and Latency

The ScMoE backbone, ZCE, and communication-computation overlap yield significant efficiency gains:

  • Compute Budgeting per Token: Only 27B of 560B parameters are activated per token, with further ∼20% total FLOPs reduction due to ZCE (Team et al., 31 Oct 2025).
  • Latency and Throughput: Inference rates exceed 100 tokens/sec/user (BF16, Nvidia H800); single-token latency is reduced to 16 ms theoretically due to ScMoE and custom communications kernels.
  • Cost: Inference costs approximate $0.70 per 1M tokens at high throughput, outperforming parameter-equivalent and larger peer models, including DeepSeek-V3 and Qwen3 (Team et al., 1 Sep 2025).
  • Long Context Scaling: LongCat ZigZag Attention (LoZA) sparsifies 50% of attention layers using lottery-ticket–inspired calibration and blockwise local/global attention. The resulting LongCat-Flash-Exp processes up to 1M tokens with minimal quality loss, 90% kernel cost savings in decode, and 50% prefill speed-up at 256K tokens (Zhang et al., 30 Dec 2025).

4. Agentic Reasoning and Reinforcement Learning Pipelines

LongCat-Flash is explicitly engineered for agentic intelligence:

  • Pre-, Mid-, Post-Training Framework: Reasoning-rich pretraining (70% STEM and code), mid-training on synthetic and multi-agent curriculum datasets, and specialized post-training for agentic tool use.
  • Agentic RL: Dynamic ORchestration for Asynchronous rollout (DORA) delivers >3× RL speedup by separating experience making and training pools, enabling multi-version rollouts and policy staleness handling (Team et al., 23 Sep 2025). RL policies are trained asynchronously across up to 32,000 environments/spawned actors, covering search, code, and tool-use domains (Team et al., 23 Jan 2026).
  • Domain-Parallel Training and Fusion: To optimize for diverse domains (STEM, code, agentic), domain experts are trained independently then fused with careful delta normalization, pruning, and dropout (Team et al., 23 Sep 2025, Team et al., 23 Jan 2026).
  • Reasoning Optimization: "Heavy Thinking" mode enables test-time scaling of both reasoning width (parallel chains) and depth (per-chain tokens) under a global token budget, outperforming both self-consistency and chain-of-thought baselines.

5. Specialized Innovations: N-gram Embeddings and Formal Reasoning

LongCat-Flash-Lite pioneers a new scaling axis:

  • Embedding Scaling Regime: By allocating up to 46% of model capacity (31.4B parameters) to a large N-gram embedding table, LongCat-Flash-Lite shifts sparsity from conditional expert FFNs to direct embedding lookups, achieving 3–4B activated parameters per token (Liu et al., 29 Jan 2026).
  • Pareto Efficiency: Embedding-based sparsity outperforms MoE at parameter activation ratios R≳12–15. Keeping the embedding ≤50% of total parameters is critical; larger fractions exhibit diminishing returns.
  • Downstream Impact: LongCat-Flash-Lite demonstrates improved loss and perplexity versus MoE baselines of equal scale (e.g., PPL 9.8 vs 10.5 on English/Chinese) and large gains on coding and agentic tasks.
  • Formal Proving: LongCat-Flash-Prover incorporates a three-expert architecture for auto-formalization, sketching, and proving in Lean4. Agentic RL is stabilized by Hierarchical Importance Sampling Policy Optimization (HisPO), including gradient masking for kernel discrepancy and staleness. Dedicated legality-detection blocks reward hacking and achieves new SOTA in pass rates on MiniF2F, ProverBench, and PutnamBench under strict inference budgets (Wang et al., 22 Mar 2026).

6. Evaluation Metrics and Benchmark Performance

LongCat-Flash and its derivatives consistently achieve competitive or SOTA results across agentic, coding, reasoning, and multimodal benchmarks:

  • General & Reasoning: MMLU, CEval, GSM8K, BBH, GPQA, DROP, AIME-25 (e.g. 90.6% Mean@32 on AIME-25, 93.7% harmful refusal accuracy).
  • Coding: HumanEval+, MBPP+, LiveCodeBench, MultiPL-E.
  • Agentic Tool Use: τ2-Bench, VitaBench, AceBench, TerminalBench (e.g., 73.7% avg@4 on τ2-Bench).
  • Long-Context: LongEval, LongBenchV2, MRCR, HELMET.
  • Multimodal: MMBench-EN, DocVQA, VideoMME, LibriSpeech.
  • Formal Reasoning: MiniF2F-Test pass@1 of 67.6%, 97.1% pass@72 for formal proofs in Lean4 (LongCat-Flash-Prover).
  • Efficiency: Token efficiency in agentic tasks improved by up to 64.5% (tokens/episode reduced from 19,653 to 6,965) with no loss in accuracy (Team et al., 23 Sep 2025).

Table: Representative Downstream Benchmark Results (selection)

Model MMLU (%) HumanEval+ (pass@1) τ2-Bench (avg@4) MiniF2F (pass@1)
LongCat-Flash-Base ~89-90 up to 79 73.7 --
LongCat-Flash-Lite (3B act.) 64.01 31.1 72.8 --
LongCat-Flash-Prover -- -- -- 67.6 (97.1#)
LongCat-Flash-Omni (text base) 86.81 -- -- --

# Pass@72 for MiniF2F.

7. Practical Implications, Community Release, and Future Directions

LongCat-Flash models redefine scalable, community-accessible foundation models:

  • Open Source and Research Ecosystem: All weights, code, and APIs are publicly released (Hugging Face, GitHub, interactive demos), facilitating further study in efficient MoE architectures, long-context scaling, multimodal integration, and agentic workflows (Team et al., 1 Sep 2025, Team et al., 31 Oct 2025).
  • Extensibility: LoZA is framework-agnostic and can sparsify any decoder-only LM using multi-expert attention; N-gram embeddings suggest orthogonal scaling axes for sparse models.
  • Deployment Guidance: Kernel and hardware-specific recommendations (e.g., FlashMLA-ETAP on H20/A100), LoZA calibration, and agentic RL recipes are detailed for practitioners (Zhang et al., 30 Dec 2025).
  • Open Challenges: Optimization of per-layer N-gram allocation, hybrid retrieval–embedding integration, multimodal N-gram embeddings, and robustness under extreme noise or very long trajectories remain active research targets (Liu et al., 29 Jan 2026, Team et al., 23 Jan 2026).

LongCat-Flash and its derivative family have advanced the frontiers of scalable, sparse, and agentic neural architectures, underlying efficient state-of-the-art performance in both unimodal and multimodal domains, and serving as a foundation for subsequent research in massive-scale, context-extended, and tool-integrated agentic intelligence.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongCat-Flash.