Papers
Topics
Authors
Recent
Search
2000 character limit reached

Yuan3.0 Flash: Enterprise Multimodal MoE LLM

Updated 6 January 2026
  • Yuan3.0 Flash is an open-source multimodal MoE large language model with 40B parameters and efficient token activation.
  • Its innovative RAPO algorithm reduces overthinking by optimizing token usage and enhancing reasoning accuracy.
  • The model excels in enterprise applications such as complex table understanding, retrieval-augmented generation, and text summarization.

Yuan3.0 Flash is an open-source Mixture-of-Experts (MoE) MultiModal LLM (MLLM) architecture designed to advance enterprise-oriented natural language and multimodal processing while delivering competitive general-purpose reasoning. The model features 40 billion total parameters with approximately 3.7 billion activated per token—an implementation yielding ≈9.25% sparsity for computational efficiency. Yuan3.0 Flash introduces the Reflection-aware Adaptive Policy Optimization (RAPO) training algorithm to mitigate the overthinking phenomenon prevalent in Large Reasoning Models (LRMs), thereby reducing unnecessary post-answer verification and optimizing token usage. It demonstrates superior results across real-world tasks such as retrieval-augmented generation, complex table understanding, and text summarization, and achieves reasoning efficiency comparable to frontier models with significantly reduced token consumption. The model is fully open-sourced, supporting extended context windows and multimodal enterprise deployment scenarios (ai et al., 5 Jan 2026).

1. Model Architecture

Yuan3.0 Flash is built upon a multimodal MoE Transformer backbone with three principal components:

  • Visual Encoder & Projection: Images are processed via InternViT-300M (pretrained vision transformer) to extract patch embeddings. These are projected into the token space using a lightweight multi-layer perceptron (MLP) with SwiGLU activations. An adaptive segmentation module slices high-resolution images into an m×nm\times n patch grid plus a global thumbnail, optimizing input processing for complex visual tasks.
  • MoE-Based Language Backbone: The architecture consists of 40 Transformer layers each equipped with Localizing Filtering-based Attention (LFA), prioritizing local token dependencies, and a Mixture-of-Experts sublayer with E=32E = 32 experts. At each forward pass, K=2K = 2 experts are selected per token. The model structure segregates ~30B parameters into expert networks and ~10B into feed-forward/attention layers.
  • Expert Routing Mechanism: For token-wise computation, the hidden state hRdh\in\mathbb{R}^d is routed via a learned matrix WgRE×dW_g\in\mathbb{R}^{E\times d}, yielding gating probabilities g=Softmax(Wgh)g = \mathrm{Softmax}(W_g h). The final expert output for each token is calculated as: MoE(h)=iTopK(g)giEi(h),\mathrm{MoE}(h) = \sum_{i \in \mathrm{TopK}(g)} g_i\,E_i(h), where TopK(g)\mathrm{TopK}(g) selects the top K=2K=2 experts. This Top-2 routing maintains sparsity and scales linearly with the number of experts, leading to efficient computation.

2. Reflection-aware Adaptive Policy Optimization (RAPO)

RAPO is a reinforcement learning (RL) training algorithm targeting the overthinking issue in LRMs—reducing post-answer self-verification that expends excessive tokens and may reduce solution accuracy.

  • Reflection Inhibition Reward Mechanism (RIRM): RIRM introduces three scalar rewards:
    • RansR_{ans}: Indicator for presence of a correct answer.
    • E=32E = 320: Penalizes excessive reflection steps E=32E = 321, with the functional form:

    E=32E = 322

    for task-dependent E=32E = 323 estimates. - E=32E = 324: Indicator for correctness of the final answer. - The total reward is E=32E = 325.

  • Optimized DAPO Objective Backbone: RAPO refines the DAPO objective with Adaptive Dynamic Sampling (ADS), an 80/20 entropy rule (policy gradients on the top 20% highest-entropy tokens), and a dual-clip mechanism to moderate negative advantages. The RL loss per batch E=32E = 326 is: E=32E = 327 where

E=32E = 328

with E=32E = 329 the likelihood ratio and K=2K = 20 the advantage.

  • Overthinking Mitigation: RAPO, via RIRM, demonstrably reduced output tokens by up to 47.1% and improved solution accuracy by up to 52.4% (on AIME-2024 and MATH-500), highlighting efficacy for reasoning tasks with chained verification procedures.

3. Enterprise-Oriented Capabilities

Yuan3.0 Flash is evaluated on rigorous enterprise benchmarks demonstrating strong multimodal and textual reasoning:

  • Retrieval-Augmented Generation (RAG): On Docmatix (multimodal/multi-page), Yuan3.0 Flash achieved 65.07%, outperforming GPT-4o (56.79%) and Qwen2.5-VL (59.75%). In ChatRAG (10 text-retrieval tasks), Flash scored 64.47% (vs. GPT-4o 50.54%, DeepSeek-V3 50.47%).

  • Complex Table Understanding: On MMTAB (15 subtasks), Yuan3.0 Flash led with 58.29% average, surpassing GPT-5.1 (55.15%) and GLM-4.5V (52.00%), ranking first in over half the tasks (e.g., TABMWP at 95.09% accuracy).

  • Text Summarization: SummEval performance was 59.31% average, with component metrics (ROUGE-1 51.32, ROUGE-2 28.32, BERTScore 89.99, SummaC 45.34), exceeding GPT-5.1.

  • Tool Invocation: On BFCL V3, it achieved 57.97% average accuracy, with balanced results across static/live execution and multi-turn relevance tasks (cf. Qwen3-235B 67.94%, Claude-3.7 58.58%).

  • Long-Text Understanding: In Needle-in-a-Haystack (NIAH) tests up to 128K context, Yuan3.0 Flash maintained perfect retrieval accuracy.

A plausible implication is that the model’s architectural and RL training novelties contribute directly to its enterprise applicability, especially for document-intensive and multimodal scenarios.

4. General-Purpose Reasoning Performance

Yuan3.0 Flash supports hybrid inference modes—“non-thinking” for direct answer tasks and “thinking” for chain-of-thought reasoning.

  • Non-Thinking Mode: For standard benchmarks without chain-of-thought, results include MATH-500 (88.7% vs. DeepSeek-V3 94.0%) and MMLU (82.9% vs. 83.4%).

  • Thinking Mode: When equipped with chain-of-thought and RIRM, Flash achieves notable efficiency:

    • AIME-2024: 47.6% accuracy with 6,086 tokens (vs. DeepSeek-R1 91.4% with 17,164 tokens, approximately 1/3 reduction).
    • MATH-500: 91.2% accuracy with 1,431 tokens (vs. 97.4% at 5,541 tokens, ≈ 1/4 reduction).
    • Comparable token efficiency is observed on HumanEval, MMLU, and MMLU-Pro.
    • A plausible implication is that Yuan3.0 Flash’s overthinking mitigation yields state-of-the-art reasoning output quality per token, optimizing compute for cost-sensitive deployments.

5. Open-Source Release and Deployment

Yuan3.0 Flash is publicly available (https://github.com/Yuan-lab-LLM/Yuan3.0) with full code and pretrained/SFT-only MoE checkpoints (40B total, 3.7B activated). The repository includes:

  • Scripts supporting multimodal inference and chain-of-thought reasoning.
  • Example pipelines for enterprise usage (long-document QA, chart/table interpretation, summarization).
  • Fine-tuning recipes for domain adaptation leveraging RAPO.

Deployment supports context windows up to 128K tokens and requires MoE-aware runtimes (e.g., DeepSpeed MoE, Megatron-LM). Hybrid inference modes (thinking vs. non-thinking) can be toggled to suit application needs. Major enterprise scenarios include intelligent customer service, report analysis, financial/table processing, visual-document assistants, code testing/generation, and scientific QA.

Limitations include specialized hardware or optimized kernels requirements for sparse MoE inference, lag on certain zero-shot general reasoning tasks compared to proprietary large models, and segmentation parameter tuning for high-resolution image inputs.

6. Significance and Prospective Directions

Yuan3.0 Flash provides a highly efficient, fully open multimodal MoE platform advancing both enterprise and general-purpose benchmarks. By merging innovative architectural design and RL-based overthinking mitigation, it approaches state-of-the-art accuracy while reducing reasoning-phase token expenditure by 50–75%. This suggests increasing accessibility and cost-effectiveness for real-world deployments, though further research may address specialized hardware support and frontier-level performance in zero-shot general tasks (ai et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Yuan3.0 Flash.