LongCat-Flash-Thinking Technical Report (2509.18883v1)

Published 23 Sep 2025 in cs.AI

Abstract: We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.

Summary

The paper demonstrates a novel two-phase training pipeline with a cold-start chain-of-thought phase followed by domain-parallel reinforcement learning.
It leverages a 560B-parameter sparse MoE architecture that averages 27B active parameters per inference for efficient, scalable performance.
It achieves state-of-the-art results in reasoning, coding, and formal theorem proving while ensuring safety and efficiency across diverse benchmarks.

LongCat-Flash-Thinking: A 560B-Parameter Open-Source MoE Reasoning Model

Overview

LongCat-Flash-Thinking is a 560B-parameter Mixture-of-Experts (MoE) LLM designed for advanced reasoning, formal theorem proving, coding, and agentic tool use. The model is distinguished by a two-phase training pipeline: a cold-start curriculum focused on long Chain-of-Thought (CoT) reasoning, followed by a large-scale, domain-parallel reinforcement learning (RL) regime. The RL phase is powered by the DORA (Dynamic ORchestration for Asynchronous rollout) system, enabling efficient, stable, and scalable training across tens of thousands of accelerators. The model achieves state-of-the-art results among open-source models on a wide array of reasoning benchmarks, with notable efficiency in agentic reasoning and formal proof generation.

Model Architecture and Training Pipeline

LongCat-Flash-Thinking is built on a sparse MoE backbone, leveraging zero-computation experts and shortcut-connected expert parallelism for computational efficiency. The model averages 27B active parameters per inference, despite a total parameter count of 560B, enabling high throughput and memory efficiency.

The training pipeline consists of:

Long CoT Cold-Start Training:
- Mid-training: Curriculum learning on reasoning-intensive STEM and coding data, with careful data mixing to avoid degradation of generalist capabilities.
- Supervised Fine-Tuning (SFT): Targeted exposure to formal reasoning (e.g., theorem proving) and agentic tool use, with rigorous data curation and quality control.
Large-Scale RL with Domain-Parallelism:
- Domain-Parallel Training: Independent RL optimization for STEM, code, and agentic domains, followed by model fusion to achieve near Pareto-optimality across all domains.
- General RL Fine-Tuning: Final alignment on diverse, challenging data to ensure robustness and safety.
  Figure 1: The distribution of our curated SFT data.

Data Curation and Reasoning Enhancement

The cold-start phase addresses the scarcity of explicit long CoT patterns in pre-training corpora by constructing a curriculum of multi-step, competition-level STEM and coding problems. Data quality is ensured via LLM-as-a-Judge filtering, model-based voting, and difficulty stratification. For formal reasoning, an autoformalizer translates natural language problems into formal statements, which are then verified and iteratively expanded via expert iteration. Agentic reasoning data is curated using a dual-path evaluation pipeline to select queries that genuinely require tool use, with solution trajectories synthesized and validated for logical completeness and tool integration.

Reinforcement Learning Infrastructure and Algorithmic Innovations

DORA System

The DORA system is a distributed, asynchronous RL framework that addresses the inefficiencies of synchronous RL in long-context, reasoning-heavy tasks. Key features include:

Streaming Rollout: Multiple policy versions are maintained, allowing responses to be generated and processed as soon as they complete, eliminating batch blocking by long-tail samples.
Elastic Colocation: RL roles (generation, training, reward computation) are dynamically assigned to devices, minimizing idle time and maximizing hardware utilization.
KV-Cache Reuse and Load Balancing: Efficient management of KV-caches and dynamic resource redistribution further reduce overhead.
Figure 2: The workflow of Load Balancing. The Load Balancing Controller monitors the load on each device, and when predefined thresholds are met, it initiates resource redistribution, including weight transfer and KV-cache transfer.

RL Algorithmic Modifications

The RL phase employs a modified Group Relative Policy Optimization (GRPO) objective, with several critical adjustments for stability and efficiency:

Token-Level Loss: Replaces sample-level loss to improve stability and mitigate length bias.
Triplet Clipping: Separate clipping bounds for positive and negative advantages to control variance, especially important for sparse MoE routing.
Truncated Importance Sampling: Mitigates distribution mismatch between inference and training engines.
Staleness Control and Data Reuse: Promotes sample efficiency while maintaining policy freshness.

The reward system is domain-adaptive: discriminative reward models for non-verifiable tasks, generative reward models with reasoning for STEM, and sandbox-based evaluation for code.

Domain-Parallel RL and Model Fusion

Domain-mixed RL was found to induce negative transfer due to distributional shifts in response characteristics. The domain-parallel approach trains separate expert models for STEM, code, and agentic domains, which are then merged using normalization, dropout, and parameter erasure strategies to minimize interference. A final general RL phase ensures broad alignment and safety.

Empirical Results

LongCat-Flash-Thinking demonstrates strong performance across a comprehensive suite of benchmarks:

Mathematical Reasoning: 99.2% on MATH-500, 93.3% on AIME-24, 83.7% on HMMT-25.
Coding: 79.4% on LiveCodeBench, 40.7% on OJBench.
Agentic Tool Use: 67.5% on $\tau^2$ -Bench-Airline, 64.5% reduction in average token consumption on AIME-25 (from 19,653 to 6,965) without loss of accuracy.
Formal Theorem Proving: 67.6% pass@1 on MiniF2F-Test, outperforming the next best open-source model by 18%.
Safety: Best-in-class refusal rates on harmful, criminal, and misinformation queries.

System and Scaling Considerations

The DORA system achieves over 3x speedup compared to synchronous RL on 560B-parameter models, scaling efficiently to tens of thousands of accelerators. Engineering optimizations include graph-level compilation for MoE parallelism, bidirectional streaming RPC, and efficient load balancing. The system is robust to the numerical inconsistencies between inference and training engines, and supports high-throughput, long-context rollouts.

Implications and Future Directions

LongCat-Flash-Thinking establishes a new standard for open-source reasoning models, particularly in formal mathematics, coding, and agentic tool use. The domain-parallel RL and model fusion methodology provides a template for future large-scale, multi-domain LLM training. The DORA system demonstrates that asynchronous, streaming RL is viable and efficient at industrial scale, with implications for both research and production deployments.

The model's efficiency in agentic reasoning—achieving substantial token savings without accuracy loss—suggests that hybrid approaches combining LLM reasoning with external tool use will be increasingly important. The formal reasoning pipeline, leveraging autoformalization and expert iteration, points toward scalable methods for instilling formal verification capabilities in LLMs.

Open-sourcing LongCat-Flash-Thinking is likely to accelerate research in high-quality data curation, scalable RL, and agentic AI systems. Future work may explore more granular expert routing, further improvements in model fusion, and tighter integration of formal verification and tool use within LLMs.

Conclusion

LongCat-Flash-Thinking demonstrates that a carefully engineered MoE LLM, trained via a domain-parallel, RL-centric pipeline and supported by a robust asynchronous infrastructure, can achieve state-of-the-art reasoning performance with high efficiency. The model's architecture, training methodology, and system design collectively advance the state of open-source reasoning models and provide a foundation for further research in scalable, agentic, and formally capable AI systems.