Papers
Topics
Authors
Recent
2000 character limit reached

LongCat-Flash-Exp: Efficient Long-Context Modeling

Updated 1 January 2026
  • LongCat-Flash-Exp is a suite for efficient long-context language modeling that integrates a Mixture-of-Experts architecture with zero-compute experts and dynamic routing.
  • It leverages LongCat ZigZag Attention (LoZA) for block-sparse, streaming topologies, scaling performance to 1M tokens in agentic reasoning tasks.
  • The framework combines advanced training regimens, overlapping compute strategies, and best practices for optimal throughput and cost efficiency.

LongCat-Flash-Exp refers to a comprehensive suite of experiments and model designs centered on efficient long-context language modeling. It encompasses the evaluation and extension of the LongCat-Flash foundation model, its mid-training modifications via LoZA (LongCat ZigZag Attention), and rigorous benchmarking—both as a Mixture-of-Experts (MoE) Transformer (560B→1.2T param) and versus state-of-the-art linear attention alternatives (e.g., LAWCAT). LongCat-Flash-Exp’s primary emphasis is on agentic reasoning, throughput, and efficient inference at extreme sequence lengths, notably scaling to 1 million tokens with competitive or superior task performance.

1. Model Architecture and Innovations

LongCat-Flash-Exp is based on variants of the LongCat-Flash architecture, a Mixture-of-Experts (MoE) Transformer developed for scalable, cost-effective, and adaptable compute allocation. The core innovations are:

  • Zero-Compute Experts:

Many-token inference exhibits high per-token variability in computational complexity. LongCat-Flash introduces ZZ zero-compute experts, which simply forward the input for "easy" tokens, and NN standard FFN experts. A softmax router with adaptive biasing (PID-controlled) ensures a target number KeK_e of FFN experts versus K−KeK-K_e zero experts is selected per token:

MoE(xt)=∑i=1N+ZgiEi(xt),Ei={FFNi(xt),i≤N xt,N<i≤N+Z\text{MoE}(x_t) = \sum_{i=1}^{N+Z} g_i E_i(x_t), \quad E_i = \begin{cases} \mathrm{FFN}_i(x_t), & i \le N \ x_t, & N < i \le N+Z \end{cases}

Router bias bb dynamically adapts to maintain KeK_e real experts averaged over time.

  • Shortcut-Connected MoE (ScMoE):

The model introduces cross-layer shortcuts, allowing communication (dispatch/combine) and computation between experts and dense blocks to be efficiently overlapped, resulting in a significant reduction in time-per-output token (TPOT). For instance, TPOT is 16 ms in LongCat-Flash (SBO), versus 30 ms in DeepSeek V3 (TBO) (Team et al., 1 Sep 2025).

  • Layer and Width Scaling:

To achieve stable scaling to 560B–1.2T parameters, the model leverages a scheme where proxy hyperparameters are transferred to the final model via "Adam LR Full Align" rules, and layer stacking (half-depth student checkpoint duplicated) optimizes convergence.

  • Stability Suite:

The training pipeline includes mechanisms such as router-vs-LM gradient balancing (Rg<0.1R_g < 0.1), a hidden zz-loss to control activation spikes, and extremely small Adam ϵ\epsilon initialization (10−1610^{-16}) to manage adaptive step-sizes at scale.

2. LongCat ZigZag Attention (LoZA) and Sparse Extension

LongCat-Flash-Exp incorporates LoZA for efficient sparse attention at extreme context lengths (Zhang et al., 30 Dec 2025):

  • Sparse Block Attention Pattern:

Each query attends to ss "sink blocks" (coarse-grained, e.g., one global per MM tokens) and ll local blocks of size bb; the total number of attended blocks for each token is S=s+lS = s + l. The block-sparse binary mask Mi,jM_{i,j} is defined as:

Mi,j={0,if ⌊j/b⌋∈Bp −∞,otherwiseM_{i,j} = \begin{cases} 0, & \text{if}~\lfloor j/b \rfloor \in B_p \ -\infty, & \text{otherwise} \end{cases}

This forms a streaming-sparse attention topology.

  • Lottery-Ticket Layer Calibration:

During calibration, all weights are frozen, learnable scalars αi\alpha_i interpolate between full and block-sparse outputs in each MLA layer. After optimizing {αi}\{\alpha_i\} on out-of-distribution calibration data, 50% of layers with lowest αi\alpha_i are converted to pure block-sparse. The remaining model undergoes further mid-training to recover potential performance gaps.

  • 1M Token Contexts and YaRN Extrapolation:

A staged context expansion (32K → 128K → 256K) is followed by YaRN-based linear scaling of attention dropout and rotary position embeddings to extrapolate generalization to 1M tokens.

3. Training Regimen and Data Mixture

  • General Pretraining:
    • Data decontamination: Test overlap (≥13-grams) and semantic filtering (BGE-m3, threshold 0.7–0.9).
    • Stage-2/3 target mix: 70% STEM/code (stage 2), 25% long-context data (stage 3).
  • Long Context and Agentic Data:

Mid- and post-training focuses on multi-turn reasoning, agentic tool-use, and lengthy context documents (up to 1M).

  • Curriculum Transfer and Layer Stacking:

A half-depth checkpoint is duplicated to form full-depth initialization, preserving optimizer state and accelerating convergence.

4. Empirical Performance and Benchmarks

The following tables summarize key throughput and task results obtained in LongCat-Flash-Exp (Team et al., 1 Sep 2025, Zhang et al., 30 Dec 2025):

Inference Throughput and Cost (LongCat-Flash):

Precision Context GPUs TPS/u Cost ($/M tokens)
bf16 5,000 128 100.5 0.7
fp8 8,192 128 33.8 0.7

End-to-End Prefill/Decode on LoZA-augmented Model (H20 cluster):

Mode LongCat-Flash LongCat-Flash-Exp
Prefill 12 tokens/ms 25 tokens/ms
Decode 0.8 tokens/ms 1.1 tokens/ms

Selected Leaderboard Scores—LongCat-Flash (27B active):

Benchmark DeepSeek Qwen3 Kimi-K2 LongCat-Flash
MMLU (%) 90.96 90.23 89.86 89.71
ArenaHard-V2 84.10 88.20 85.70 86.50
TerminalBench 31.30 17.28 25.93 39.51
τ²-Bench avg@4 49.13 43.01 64.17 67.65
VitaBench 20.3 8.5 18.2 24.3

LongBench and Agentic Benchmarks (LoZA extension):

  • LongEval: Exp-Base 99.3% (vs Flash-Base 95.7%)
  • SWE-Bench: 63.2 (vs 60.4)
  • Terminal-Bench: 42.5 (vs 39.5)
  • MRCR AUC: +5% over Qwen-3 (at 1M tokens)
  • HELMET: 64.7% (vs 59.1% baseline)
  • Ablation: hand-crafted interleaved sparsity on LongEval drops from 95.7 → 54.1; calibrated LoZA sparsity retains 89.6.

5. Linear Attention Alternatives and Comparative Analysis

LAWCAT provides an O(N) kernel with favorable properties relative to quadratic baselines (Liu et al., 22 Sep 2025):

After a Conv1D smoothing over queries and keys (kernel size 4), outputs are fed into a normalized, gated linear recurrence:

St=Gt⊙St−1+k˙t⊤v˙tS_t = G_t \odot S_{t-1} + \dot{k}_t^{\top} \dot{v}_t

with explicit normalization critical for long-range stability.

  • Distillation and Generalization:
    • Passkey retrieval: LAWCAT maintains 95–91% accuracy up to 16–22K, where FlashAttention-2 performance collapses past 8K.
    • Throughput: Linear attention kernels surpass FA2 in tokens/sec at sequence lengths >8K; memory remains O(N)O(N).
  • Edge and Streaming Suitability:

Low memory footprint (<12GB at 32K), linear latency, and seamless streaming favor deployment on single GPUs or in edge contexts.

6. Analysis, Limitations, and Best Practices

  • Dynamic Compute Scaling:

Zero-compute experts and ScMoE enable substantial resource savings. Only "hard" tokens invoke FFN experts, lowering average active parameter count to ∼27B for a 560B model.

  • Scalability and Stability:

Layer stacking, hyperparameter transfer, and the stability suite enable reproducible scaling with high hardware availability (98.48%). Tuning of Adam's ε\varepsilon, z-loss, and router balance parameter α\alpha is required at massive scale.

  • Long-Context Generalization:

LoZA’s layer-level calibration, streaming block-sparse attention, and selective sparsification preserve or outperform full-attention baselines across both retrieval-augmented (prefill) and decode-intensive (tool-integrated) tasks up to 1M tokens.

  • Limitations:

Slightly reduced performance is observed on some extreme long-context structured reasoning tasks (e.g., GraphWalks-128K). The engineering burden (custom kernels, routing-loss tuning, distributed infra) is nontrivial.

  • Best Practices:
  1. Implement zero-expert routing with a bias PID controller.
  2. Calibrate router-vs-LM gradients to Rg<0.1R_g < 0.1.
  3. Control rare activation spikes using zz-loss.
  4. Use width scaling to accelerate hyperparameter search.
  5. Stack layered checkpoints for fast and reliable initialization.
  6. Minimize Adam ε\varepsilon to be lower than observed grad RMS.
  7. Invest in overlapping comm/compute (ScMoE/SBO) for maximal throughput.
  8. Deploy speculative decoding in inference with MTP heads and SBO.

7. Implications and Context in Long-Context Language Modeling

LongCat-Flash-Exp exemplifies a convergent trend in large-scale language modeling: fusing Mixture-of-Experts, dynamic compute routing, and sparse or linear attention paradigms. The combination of LoZA-based sparse extension and LAWCAT-style linearization demonstrates that the quadratic barrier to context scaling can be mitigated via both architectural and training-process innovations.

The reported suite also provides an empirical roadmap: integrate data decontamination, robust agentic/long-form training, and efficient, reproducible hardware and software pipelines. As a result, LongCat-Flash-Exp provides a foundational technical reference for the design, scaling, and evaluation of long-context agentic LLMs targeting both cloud and edge deployment scenarios (Team et al., 1 Sep 2025, Zhang et al., 30 Dec 2025, Liu et al., 22 Sep 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to LongCat-Flash-Exp.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube