LongCat-Flash-Omni: Unified Omni-modal AI
- LongCat-Flash-Omni is a unified, open-source omni-modal model designed with a 560B-parameter backbone that processes text, audio, image, and video inputs.
- It employs a scalable Mixture-of-Experts architecture with adaptive sparsity and modality-decoupled parallelism to achieve low-latency, high-throughput real-time streaming.
- A progressive, curriculum-inspired multi-stage training regimen and advanced distributed systems underpin its strong performance across multimodal benchmarks.
LongCat-Flash-Omni is an open-source, unified omni-modal model comprising 560 billion parameters (with 27 billion parameters activated per inference token) engineered for high-performance real-time audio-visual interaction and comprehensive performance across modality-specific and multimodal benchmarks. Developed by the Meituan LongCat team, it integrates innovations in Mixture-of-Experts (MoE) architecture, curriculum-motivated multi-stage training, modality-decoupled parallelism for large-scale compute, and real-time streaming pipelines. The architecture and training infrastructure extend upon the LongCat-Flash series, incorporating specialized perception and generative modules for audio, vision, and text.
1. Model Architecture and Components
LongCat-Flash-Omni employs an end-to-end unified framework handling arbitrary combinations of text, audio, image, and video inputs. All modalities are projected into a shared latent token space processed by a central Mixture-of-Experts (MoE) LLM backbone.
- Backbone: Shortcut-connected Mixture-of-Experts (ScMoE) architecture with zero-computation experts, enabling per-token adaptive sparsity. Only 18.6B–31.3B parameters (mean 27B) are active at inference per token. The ScMoE employs parallelized dense and MoE branches to maximize computation–communication overlap.
- Vision Encoder (LongCat-ViT): Transformer-based encoder (~637M parameters) supporting arbitrary resolution/aspect ratio, with 2D-roPE positional encoding, LayerScale, SwiGLU, RMSNorm, and frame compression for video.
- Audio Encoder: Streaming FSMN-Transformer hybrid (~600M parameters), facilitating low-latency encoding of both continuous feature frames and tokenized sequences.
- Audio Tokenizer/Decoder: Multi-codebook VQ-based LongCat-Audio-Codec with streaming LSTM+GAN decoder for direct waveform synthesis.
- Unified Multimodal Handling: Modalities are chunked and interleaved with explicit timestamps, allowing precise alignment across temporally-resolved inputs and supporting streaming, overlapping inference.
All architectural components are optimized for minimum latency and high throughput, with decoder routing supporting parallel text and speech token generation.
2. Training Regimen: Curriculum-inspired Progressive Strategy
LongCat-Flash-Omni adopts a progressive, curriculum-based multi-stage training pipeline to cultivate capability across modalities and tasks:
- Stage-0: LLM pre-training on 16T text tokens emphasizing reasoning, mathematics, and code.
- Stage-1: Speech-text interleaved pre-training, applying weighted multi-loss objectives to promote aligned text and audio understanding:
with optimal weights , , , .
- Stage-2: Multimodal pre-training with joint vision-text data, vision encoder integration, and token ratio text:speech:vision ≈ 2:1:1.
- Stage-3: Multimodal annealing with incorporation of video, OCR, STEM, grounding, and dynamic data mixing via perplexity-gap-based sampling.
- Stage-4: Long-context extension—context window enlarged to 128,000 tokens using RoPE base adjustment, with both synthetic and real long-form data.
- Stage-5: Audio encoder alignment, projecting continuous features into LLM space with a frozen backbone.
Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) for human preference alignment complete the training regimen:
with simultaneous multi-modal consistency objectives.
3. Modality-decoupled Parallelism and Infrastructure
LongCat-Flash-Omni introduces Modality-Decoupled Parallelism (MDP) as a solution to data and model heterogeneity during large-scale multimodal training. The vision, audio, and LLM components are independently sharded and scheduled.
- Encoders: Hybrid Sharding Data Parallelism (HSDP).
- LLM: Pipeline parallelism (PP), ZeRO-1 Data Parallelism (DP), context parallelism (CP), and expert parallelism (EP).
- ModalityBridge: Facilitates chunk-based redistributions, reducing per-device memory from to .
- InnerDP: Custom group aligns microbatches across encoder and LLM schedules.
MDP achieves over 90% throughput of text-only LLMs at 560B scale, maintaining stable memory and communication overhead, an advancement over conventional FSDP or monolithic parallelism.
4. Real-Time Audio-Visual Processing
For real-time streaming, LongCat-Flash-Omni combines encoder efficiency, aggressive sparsity, and chunk-wise AV synchronization:
- Chunk-wise AV synchronization: Synchronous chunking and timestamping of audio and video token streams for temporal coherence.
- Streaming pipeline: Concurrent Voice Activity Detection (VAD), frame sampling, multimodal encoding, LLM prefill-decode, and audio waveform decoding via modular, asynchronous scheduling.
- Speculative prefill-decode switching: LLM computation occurs in parallel with VAD endpoint detection, yielding 100 ms post-endpoint first-packet latency.
- Sparse-dense sampling: Modalities sample densely during user input and sparsely during model outputs, optimizing both latency and informational throughput.
Despite architectural scale, serving infrastructure achieves millisecond-range latency per interaction, enabling responsive multi-modal conversational interfaces.
5. Performance Benchmarks
LongCat-Flash-Omni ranks at or near the top among open models on leading omni-modal and single-modality benchmarks.
| Benchmark | LongCat-Flash-Omni | Gemini-2.5-Pro (closed) | Qwen3-Omni (open) |
|---|---|---|---|
| OmniBench | 61.38 | 66.80 | 58.41 |
| WorldSense | 60.89 | 63.96 | 52.01 |
| UNO-Bench | 49.90 | 64.48 | 42.10 |
- Image: Comparable to Gemini-2.5-Flash; strong results on MMBench, RealWorldQA, MMStar.
- Video: Best short-video performance (MVBench), competitive for long-video and temporal tasks (VideoMME, MMVU).
- Audio: State-of-the-art or competitive in ASR (LibriSpeech, AISHELL), audio understanding (MMAU, ClothoAQA), S2T (CoVost2).
- Text: No measurable degradation versus top text-only LLMs—SOTA on MMLU, CEval, GPQA.
- Subjective Evaluation: Human ratings on naturalness, fluency, memory, and relevance—only GPT-4o and Doubao score higher.
6. Open-Source Release and Research Impact
Complete weights, model code, training strategies, and infrastructure are open-sourced via https://longcat.ai, Hugging Face, and GitHub. Benchmarks, evaluation scripts, and standardized pipelines are provided to ensure transparency and enable reproducible research.
The model demonstrates that co-design of large-scale sparsity, architectural innovation (ScMoE with zero-experts), curriculum optimization, and advanced distributed systems enable unified, low-latency, cross-modal models at extreme parameter counts. The release establishes a reference for the development of next-generation AGI-aligned, real-time omni-modal AI, foundational to agentic, embodied, and human-interactive systems research.
Key Technical Summary Table
| Component | Details |
|---|---|
| Backbone | ScMoE with zero-computation experts, 560B parameters (27B active per token) |
| Vision Encoder | LongCat-ViT, Transformer, arbitrary res/AR, 2D-roPE, video support |
| Audio Encoder | Streaming FSMN+Transformer, 80ms frames, hybrid causal/non-causal layout |
| Audio Decoder | Multi-codebook, direct code2wave (LSTM+conv+GAN), streaming |
| Max Context | 128K tokens |
| Training | Curriculum: text → speech-text → vision → video → long-context → audio alignment |
| Parallelism | MDP; HSDP for encoders, PP+DP+CP+EP for LLM |
| Latency | 100 ms after endpoint (speculative prefill-decode) |
| Open-source | Model, infra, training, benchmarks |