Papers
Topics
Authors
Recent
2000 character limit reached

Yuan3.0 Flash: Enterprise Multimodal MoE LLM

Updated 6 January 2026
  • Yuan3.0 Flash is an open-source multimodal MoE large language model with 40B parameters and efficient token activation.
  • Its innovative RAPO algorithm reduces overthinking by optimizing token usage and enhancing reasoning accuracy.
  • The model excels in enterprise applications such as complex table understanding, retrieval-augmented generation, and text summarization.

Yuan3.0 Flash is an open-source Mixture-of-Experts (MoE) MultiModal LLM (MLLM) architecture designed to advance enterprise-oriented natural language and multimodal processing while delivering competitive general-purpose reasoning. The model features 40 billion total parameters with approximately 3.7 billion activated per token—an implementation yielding ≈9.25% sparsity for computational efficiency. Yuan3.0 Flash introduces the Reflection-aware Adaptive Policy Optimization (RAPO) training algorithm to mitigate the overthinking phenomenon prevalent in Large Reasoning Models (LRMs), thereby reducing unnecessary post-answer verification and optimizing token usage. It demonstrates superior results across real-world tasks such as retrieval-augmented generation, complex table understanding, and text summarization, and achieves reasoning efficiency comparable to frontier models with significantly reduced token consumption. The model is fully open-sourced, supporting extended context windows and multimodal enterprise deployment scenarios (ai et al., 5 Jan 2026).

1. Model Architecture

Yuan3.0 Flash is built upon a multimodal MoE Transformer backbone with three principal components:

  • Visual Encoder & Projection: Images are processed via InternViT-300M (pretrained vision transformer) to extract patch embeddings. These are projected into the token space using a lightweight multi-layer perceptron (MLP) with SwiGLU activations. An adaptive segmentation module slices high-resolution images into an m×nm\times n patch grid plus a global thumbnail, optimizing input processing for complex visual tasks.
  • MoE-Based Language Backbone: The architecture consists of 40 Transformer layers each equipped with Localizing Filtering-based Attention (LFA), prioritizing local token dependencies, and a Mixture-of-Experts sublayer with %%%%1%%%% experts. At each forward pass, K=2K = 2 experts are selected per token. The model structure segregates ~30B parameters into expert networks and ~10B into feed-forward/attention layers.
  • Expert Routing Mechanism: For token-wise computation, the hidden state hRdh\in\mathbb{R}^d is routed via a learned matrix WgRE×dW_g\in\mathbb{R}^{E\times d}, yielding gating probabilities g=Softmax(Wgh)g = \mathrm{Softmax}(W_g h). The final expert output for each token is calculated as: MoE(h)=iTopK(g)giEi(h),\mathrm{MoE}(h) = \sum_{i \in \mathrm{TopK}(g)} g_i\,E_i(h), where TopK(g)\mathrm{TopK}(g) selects the top K=2K=2 experts. This Top-2 routing maintains sparsity and scales linearly with the number of experts, leading to efficient computation.

2. Reflection-aware Adaptive Policy Optimization (RAPO)

RAPO is a reinforcement learning (RL) training algorithm targeting the overthinking issue in LRMs—reducing post-answer self-verification that expends excessive tokens and may reduce solution accuracy.

  • Reflection Inhibition Reward Mechanism (RIRM): RIRM introduces three scalar rewards:
    • RansR_{ans}: Indicator for presence of a correct answer.
    • Rver(v)R_{ver}(v): Penalizes excessive reflection steps vv, with the functional form:

    Rver(v)={1,vrmin 1vrminrmaxrmin,rmin<vrmax 0,v>rmaxR_{ver}(v) = \begin{cases} 1, & v \le r_{\min} \ 1 - \frac{v - r_{\min}}{r_{\max} - r_{\min}}, & r_{\min} < v \le r_{\max} \ 0, & v > r_{\max} \end{cases}

    for task-dependent rmin,rmaxr_{\min}, r_{\max} estimates. - RaccR_{acc}: Indicator for correctness of the final answer. - The total reward is Rreflect=Rans+Rver+Racc[0,3]R_{\mathrm{reflect}} = R_{ans} + R_{ver} + R_{acc} \in [0,3].

  • Optimized DAPO Objective Backbone: RAPO refines the DAPO objective with Adaptive Dynamic Sampling (ADS), an 80/20 entropy rule (policy gradients on the top 20% highest-entropy tokens), and a dual-clip mechanism to moderate negative advantages. The RL loss per batch Br\mathcal{B}_r is: JBr=E(q,a)Br,  oπθold[1ot=1oI[Htτρ]PGt(θ)],J^{\mathcal B_r} = \mathbb{E}_{(q,a)\sim\mathcal B_r,\;o\sim\pi_{\theta_{\mathrm{old}}}}\left[\frac{1}{\sum |o|}\sum_{t=1}^{|o|}\mathbb{I}[H_t \ge \tau_\rho]\,PG_t(\theta)\right], where

PGt(θ)={clip(rt,1ϵlow,1+ϵhigh)A^t,A^t0 min(rtA^t,clip(rt,1ϵlow,1+ϵhigh)A^t),A^t>0PG_t(\theta) = \begin{cases} \mathrm{clip}(r_t, 1-\epsilon_{low}, 1+\epsilon_{high})\,\hat A_t,& \hat A_t \le 0 \ \min \bigl(r_t\,\hat A_t,\,\mathrm{clip}(r_t, 1-\epsilon_{low}, 1+\epsilon_{high})\,\hat A_t\bigr),& \hat A_t > 0 \end{cases}

with rtr_t the likelihood ratio and A^t\hat{A}_t the advantage.

  • Overthinking Mitigation: RAPO, via RIRM, demonstrably reduced output tokens by up to 47.1% and improved solution accuracy by up to 52.4% (on AIME-2024 and MATH-500), highlighting efficacy for reasoning tasks with chained verification procedures.

3. Enterprise-Oriented Capabilities

Yuan3.0 Flash is evaluated on rigorous enterprise benchmarks demonstrating strong multimodal and textual reasoning:

  • Retrieval-Augmented Generation (RAG): On Docmatix (multimodal/multi-page), Yuan3.0 Flash achieved 65.07%, outperforming GPT-4o (56.79%) and Qwen2.5-VL (59.75%). In ChatRAG (10 text-retrieval tasks), Flash scored 64.47% (vs. GPT-4o 50.54%, DeepSeek-V3 50.47%).

  • Complex Table Understanding: On MMTAB (15 subtasks), Yuan3.0 Flash led with 58.29% average, surpassing GPT-5.1 (55.15%) and GLM-4.5V (52.00%), ranking first in over half the tasks (e.g., TABMWP at 95.09% accuracy).

  • Text Summarization: SummEval performance was 59.31% average, with component metrics (ROUGE-1 51.32, ROUGE-2 28.32, BERTScore 89.99, SummaC 45.34), exceeding GPT-5.1.

  • Tool Invocation: On BFCL V3, it achieved 57.97% average accuracy, with balanced results across static/live execution and multi-turn relevance tasks (cf. Qwen3-235B 67.94%, Claude-3.7 58.58%).

  • Long-Text Understanding: In Needle-in-a-Haystack (NIAH) tests up to 128K context, Yuan3.0 Flash maintained perfect retrieval accuracy.

A plausible implication is that the model’s architectural and RL training novelties contribute directly to its enterprise applicability, especially for document-intensive and multimodal scenarios.

4. General-Purpose Reasoning Performance

Yuan3.0 Flash supports hybrid inference modes—“non-thinking” for direct answer tasks and “thinking” for chain-of-thought reasoning.

  • Non-Thinking Mode: For standard benchmarks without chain-of-thought, results include MATH-500 (88.7% vs. DeepSeek-V3 94.0%) and MMLU (82.9% vs. 83.4%).

  • Thinking Mode: When equipped with chain-of-thought and RIRM, Flash achieves notable efficiency:

    • AIME-2024: 47.6% accuracy with 6,086 tokens (vs. DeepSeek-R1 91.4% with 17,164 tokens, approximately 1/3 reduction).
    • MATH-500: 91.2% accuracy with 1,431 tokens (vs. 97.4% at 5,541 tokens, ≈ 1/4 reduction).
    • Comparable token efficiency is observed on HumanEval, MMLU, and MMLU-Pro.
    • A plausible implication is that Yuan3.0 Flash’s overthinking mitigation yields state-of-the-art reasoning output quality per token, optimizing compute for cost-sensitive deployments.

5. Open-Source Release and Deployment

Yuan3.0 Flash is publicly available (https://github.com/Yuan-lab-LLM/Yuan3.0) with full code and pretrained/SFT-only MoE checkpoints (40B total, 3.7B activated). The repository includes:

  • Scripts supporting multimodal inference and chain-of-thought reasoning.
  • Example pipelines for enterprise usage (long-document QA, chart/table interpretation, summarization).
  • Fine-tuning recipes for domain adaptation leveraging RAPO.

Deployment supports context windows up to 128K tokens and requires MoE-aware runtimes (e.g., DeepSpeed MoE, Megatron-LM). Hybrid inference modes (thinking vs. non-thinking) can be toggled to suit application needs. Major enterprise scenarios include intelligent customer service, report analysis, financial/table processing, visual-document assistants, code testing/generation, and scientific QA.

Limitations include specialized hardware or optimized kernels requirements for sparse MoE inference, lag on certain zero-shot general reasoning tasks compared to proprietary large models, and segmentation parameter tuning for high-resolution image inputs.

6. Significance and Prospective Directions

Yuan3.0 Flash provides a highly efficient, fully open multimodal MoE platform advancing both enterprise and general-purpose benchmarks. By merging innovative architectural design and RL-based overthinking mitigation, it approaches state-of-the-art accuracy while reducing reasoning-phase token expenditure by 50–75%. This suggests increasing accessibility and cost-effectiveness for real-world deployments, though further research may address specialized hardware support and frontier-level performance in zero-shot general tasks (ai et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Yuan3.0 Flash.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube