LongCat-Flash-Exp: Efficient Long-Context Modeling

Updated 1 January 2026

LongCat-Flash-Exp is a suite for efficient long-context language modeling that integrates a Mixture-of-Experts architecture with zero-compute experts and dynamic routing.
It leverages LongCat ZigZag Attention (LoZA) for block-sparse, streaming topologies, scaling performance to 1M tokens in agentic reasoning tasks.
The framework combines advanced training regimens, overlapping compute strategies, and best practices for optimal throughput and cost efficiency.

LongCat-Flash-Exp refers to a comprehensive suite of experiments and model designs centered on efficient long-context language modeling. It encompasses the evaluation and extension of the LongCat-Flash foundation model, its mid-training modifications via LoZA (LongCat ZigZag Attention), and rigorous benchmarking—both as a Mixture-of-Experts (MoE) Transformer (560B→1.2T param) and versus state-of-the-art linear attention alternatives (e.g., LAWCAT). LongCat-Flash-Exp’s primary emphasis is on agentic reasoning, throughput, and efficient inference at extreme sequence lengths, notably scaling to 1 million tokens with competitive or superior task performance.

1. Model Architecture and Innovations

LongCat-Flash-Exp is based on variants of the LongCat-Flash architecture, a Mixture-of-Experts (MoE) Transformer developed for scalable, cost-effective, and adaptable compute allocation. The core innovations are:

Zero-Compute Experts:

Many-token inference exhibits high per-token variability in computational complexity. LongCat-Flash introduces $Z$ zero-compute experts, which simply forward the input for "easy" tokens, and $N$ standard FFN experts. A softmax router with adaptive biasing (PID-controlled) ensures a target number $K_e$ of FFN experts versus $K-K_e$ zero experts is selected per token:

$\text{MoE}(x_t) = \sum_{i=1}^{N+Z} g_i E_i(x_t), \quad E_i = \begin{cases} \mathrm{FFN}_i(x_t), & i \le N \ x_t, & N < i \le N+Z \end{cases}$

Router bias $b$ dynamically adapts to maintain $K_e$ real experts averaged over time.

Shortcut-Connected MoE (ScMoE):

The model introduces cross-layer shortcuts, allowing communication (dispatch/combine) and computation between experts and dense blocks to be efficiently overlapped, resulting in a significant reduction in time-per-output token (TPOT). For instance, TPOT is 16 ms in LongCat-Flash (SBO), versus 30 ms in DeepSeek V3 (TBO) (Team et al., 1 Sep 2025).

Layer and Width Scaling:

To achieve stable scaling to 560B–1.2T parameters, the model leverages a scheme where proxy hyperparameters are transferred to the final model via "Adam LR Full Align" rules, and layer stacking (half-depth student checkpoint duplicated) optimizes convergence.

Stability Suite:

The training pipeline includes mechanisms such as router-vs-LM gradient balancing ( $R_g < 0.1$ ), a hidden $z$ -loss to control activation spikes, and extremely small Adam $\epsilon$ initialization ( $10^{-16}$ ) to manage adaptive step-sizes at scale.

2. LongCat ZigZag Attention (LoZA) and Sparse Extension

LongCat-Flash-Exp incorporates LoZA for efficient sparse attention at extreme context lengths (Zhang et al., 30 Dec 2025):

Sparse Block Attention Pattern:

Each query attends to $s$ "sink blocks" (coarse-grained, e.g., one global per $M$ tokens) and $l$ local blocks of size $b$ ; the total number of attended blocks for each token is $S = s + l$ . The block-sparse binary mask $M_{i,j}$ is defined as:

$M_{i,j} = \begin{cases} 0, & \text{if}~\lfloor j/b \rfloor \in B_p \ -\infty, & \text{otherwise} \end{cases}$

This forms a streaming-sparse attention topology.

Lottery-Ticket Layer Calibration:

During calibration, all weights are frozen, learnable scalars $\alpha_i$ interpolate between full and block-sparse outputs in each MLA layer. After optimizing $\{\alpha_i\}$ on out-of-distribution calibration data, 50% of layers with lowest $\alpha_i$ are converted to pure block-sparse. The remaining model undergoes further mid-training to recover potential performance gaps.

1M Token Contexts and YaRN Extrapolation:

A staged context expansion (32K → 128K → 256K) is followed by YaRN-based linear scaling of attention dropout and rotary position embeddings to extrapolate generalization to 1M tokens.

3. Training Regimen and Data Mixture

General Pretraining:
- Data decontamination: Test overlap (≥13-grams) and semantic filtering (BGE-m3, threshold 0.7–0.9).
- Stage-2/3 target mix: 70% STEM/code (stage 2), 25% long-context data (stage 3).
Long Context and Agentic Data:

Mid- and post-training focuses on multi-turn reasoning, agentic tool-use, and lengthy context documents (up to 1M).

Curriculum Transfer and Layer Stacking:

A half-depth checkpoint is duplicated to form full-depth initialization, preserving optimizer state and accelerating convergence.

4. Empirical Performance and Benchmarks

The following tables summarize key throughput and task results obtained in LongCat-Flash-Exp (Team et al., 1 Sep 2025, Zhang et al., 30 Dec 2025):

Inference Throughput and Cost (LongCat-Flash):

Precision	Context	GPUs	TPS/u	Cost ($/M tokens)
bf16	5,000	128	100.5	0.7
fp8	8,192	128	33.8	0.7

End-to-End Prefill/Decode on LoZA-augmented Model (H20 cluster):

Mode	LongCat-Flash	LongCat-Flash-Exp
Prefill	12 tokens/ms	25 tokens/ms
Decode	0.8 tokens/ms	1.1 tokens/ms

Selected Leaderboard Scores—LongCat-Flash (27B active):

Benchmark	DeepSeek	Qwen3	Kimi-K2	LongCat-Flash
MMLU (%)	90.96	90.23	89.86	89.71
ArenaHard-V2	84.10	88.20	85.70	86.50
TerminalBench	31.30	17.28	25.93	39.51
τ²-Bench avg@4	49.13	43.01	64.17	67.65
VitaBench	20.3	8.5	18.2	24.3

LongBench and Agentic Benchmarks (LoZA extension):

LongEval: Exp-Base 99.3% (vs Flash-Base 95.7%)
SWE-Bench: 63.2 (vs 60.4)
Terminal-Bench: 42.5 (vs 39.5)
MRCR AUC: +5% over Qwen-3 (at 1M tokens)
HELMET: 64.7% (vs 59.1% baseline)
Ablation: hand-crafted interleaved sparsity on LongEval drops from 95.7 → 54.1; calibrated LoZA sparsity retains 89.6.

5. Linear Attention Alternatives and Comparative Analysis

LAWCAT provides an O(N) kernel with favorable properties relative to quadratic baselines (Liu et al., 22 Sep 2025):

Causal Conv1D and Normalized Gated Linear Attention (GLA):

After a Conv1D smoothing over queries and keys (kernel size 4), outputs are fed into a normalized, gated linear recurrence:

$S_t = G_t \odot S_{t-1} + \dot{k}_t^{\top} \dot{v}_t$

with explicit normalization critical for long-range stability.

Distillation and Generalization:
- Passkey retrieval: LAWCAT maintains 95–91% accuracy up to 16–22K, where FlashAttention-2 performance collapses past 8K.
- Throughput: Linear attention kernels surpass FA2 in tokens/sec at sequence lengths >8K; memory remains $O(N)$ .
Edge and Streaming Suitability:

Low memory footprint (<12GB at 32K), linear latency, and seamless streaming favor deployment on single GPUs or in edge contexts.

6. Analysis, Limitations, and Best Practices

Dynamic Compute Scaling:

Zero-compute experts and ScMoE enable substantial resource savings. Only "hard" tokens invoke FFN experts, lowering average active parameter count to ∼27B for a 560B model.

Scalability and Stability:

Layer stacking, hyperparameter transfer, and the stability suite enable reproducible scaling with high hardware availability (98.48%). Tuning of Adam's $\varepsilon$ , z-loss, and router balance parameter $\alpha$ is required at massive scale.

Long-Context Generalization:

LoZA’s layer-level calibration, streaming block-sparse attention, and selective sparsification preserve or outperform full-attention baselines across both retrieval-augmented (prefill) and decode-intensive (tool-integrated) tasks up to 1M tokens.

Limitations:

Slightly reduced performance is observed on some extreme long-context structured reasoning tasks (e.g., GraphWalks-128K). The engineering burden (custom kernels, routing-loss tuning, distributed infra) is nontrivial.

Best Practices:

Implement zero-expert routing with a bias PID controller.
Calibrate router-vs-LM gradients to $R_g < 0.1$ .
Control rare activation spikes using $z$ -loss.
Use width scaling to accelerate hyperparameter search.
Stack layered checkpoints for fast and reliable initialization.
Minimize Adam $\varepsilon$ to be lower than observed grad RMS.
Invest in overlapping comm/compute (ScMoE/SBO) for maximal throughput.
Deploy speculative decoding in inference with MTP heads and SBO.

7. Implications and Context in Long-Context Language Modeling

LongCat-Flash-Exp exemplifies a convergent trend in large-scale language modeling: fusing Mixture-of-Experts, dynamic compute routing, and sparse or linear attention paradigms. The combination of LoZA-based sparse extension and LAWCAT-style linearization demonstrates that the quadratic barrier to context scaling can be mitigated via both architectural and training-process innovations.

The reported suite also provides an empirical roadmap: integrate data decontamination, robust agentic/long-form training, and efficient, reproducible hardware and software pipelines. As a result, LongCat-Flash-Exp provides a foundational technical reference for the design, scaling, and evaluation of long-context agentic LLMs targeting both cloud and edge deployment scenarios (Team et al., 1 Sep 2025, Zhang et al., 30 Dec 2025, Liu et al., 22 Sep 2025).

PDF Markdown Chat (Pro)

References (3)

LongCat-Flash Technical Report (2025)

Efficient Context Scaling with LongCat ZigZag Attention (2025)

LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to LongCat-Flash-Exp.

LongCat-Flash-Exp: Efficient Long-Context Modeling

1. Model Architecture and Innovations

2. LongCat ZigZag Attention (LoZA) and Sparse Extension

3. Training Regimen and Data Mixture

4. Empirical Performance and Benchmarks

5. Linear Attention Alternatives and Comparative Analysis

6. Analysis, Limitations, and Best Practices

7. Implications and Context in Long-Context Language Modeling

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LongCat-Flash-Exp: Efficient Long-Context Modeling

1. Model Architecture and Innovations

2. LongCat ZigZag Attention (LoZA) and Sparse Extension

3. Training Regimen and Data Mixture

4. Empirical Performance and Benchmarks

5. Linear Attention Alternatives and Comparative Analysis

6. Analysis, Limitations, and Best Practices

7. Implications and Context in Long-Context Language Modeling

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research