JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

Published 3 Apr 2026 in cs.CL and cs.AI | (2604.03044v2)

Abstract: We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) LLM designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

Abstract PDF Upgrade to Chat

Authors (69)

First 10 authors:

Summary

The paper introduces JoyAI-LLM Flash, a mid-scale sparse MoE LLM that leverages a top-8 gating mechanism and multi-stage training for unprecedented token efficiency.
It demonstrates a comprehensive pretraining approach using both real and synthetic data, yielding competitive performance in reasoning, math, and coding tasks.
The paper presents a novel RL alignment algorithm, FiberPO, which ensures stable compositional trust-region regulation and enhances long-context inference.

Detailed Summary of “JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency” (2604.03044)

Model Architecture and Pretraining Paradigm

JoyAI-LLM Flash is a sparse Mixture-of-Experts (MoE) LLM optimized for the sub-50B parameter regime, with a total of 48B parameters and only 2.7B activated per forward pass. This architectural design leverages a pure attention-based topology augmented with a Top-8 gating mechanism across 256 routed experts and one shared expert, yielding a superior sparsity ratio versus contemporary models at equivalent scale. The micro-architecture draws from DeepSeek-V3 and Kimi-K2, employing Multi-head Latent Attention (MLA), RMSNorm, RoPE, and SwiGLU activations. Robustness and convergence in large-scale optimization are achieved using the Muon optimizer, which implements spectral norm matrix updates, addressing instability seen in Adam-based alternatives.

Pretraining leverages a corpus of 20.7 trillion diversified tokens, partitioned into four stages: foundational linguistic exposure, code-math enhancement, ultra-high-quality refinement, and long-context extension (up to 128K context length). Data pipelines integrate aggressive rule-based and model-based quality filtering, advanced semantic safety screening, distributed MinHash-LSH deduplication, and staged synthetic augmentation (MAGA reformulation, Nemotron-CC QA synthesis, STEM solution generation, agentic trajectory distillation). Notably, synthetic data forms over 60% of mid-stage training tokens, prioritizing multi-step reasoning and agentic task learning. The curriculum incorporates both real-world and synthetic traces, balancing broad coverage and deep domain expertise.

Empirical evaluation across general knowledge (MMLU, MMLU-Pro, CMMLU), math reasoning (GSM8K, MATH, MATH-500), coding (HumanEval, LiveCodeBench), and long-context benchmarks (RULER) demonstrates competitive or superior performance, particularly in reasoning and math, versus Qwen3-30B-A3B and Qwen3.5-35B-A3B baselines. Scaling laws from prior works are utilized to inform architecture and data scaling, with observed alignment between theoretical predictions and empirical training curves.

Post-Training and Alignment Protocols

JoyAI-LLM Flash distinguishes itself with a multi-stage post-training pipeline consisting of supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and domain-diverse Reinforcement Learning (RL). SFT employs a dynamically weighted mixture of cognitive modes (“thinking” and “non-thinking”), heavily upweights coding and agentic traces, and packs sequences to 128K context for maximal compute utilization. The SFT corpus spans mathematics, coding, agentic reasoning, tool use, safety, theorem proving, creative tasks, and multilingual dialogues.

DPO aligns outputs by training on curated preference pairs derived from SFT failure modes (hallucination, instruction deviation), efficiently penalizing undesirable responses and rapidly converging to improved alignment prior to RL.

For RL-based alignment, the paper introduces FiberPO, an RL algorithm anchored in fiber bundle theory that decomposes trust-region maintenance into trajectory-level (global) and token-level (local) components. FiberPO provides compositional multi-scale stability control, provably achieving first-order fidelity to the true RL objective near on-policy and a restorative gradient structure. Unlike prior PPO, GRPO, or GSPO objectives, FiberPO’s two-scale decomposition preserves token-level discriminative gradients even under substantial trajectory drift, eliminating degenerate compression and collapse observed in baseline RL methods. Empirical results across math RLVR benchmarks show monotonic accuracy improvements, entropy preservation, and validation accuracy gains for FiberPO over GRPO and GSPO, with corresponding reductions in mean response lengths.

Multi-domain RL extension, leveraging domain-balanced curriculum sampling, demonstrates inherent generalization capabilities: FiberPO reallocates trajectory-level trust regions across domain-heterogeneous batches, retaining existing skill sets while optimizing across new environments. Catastrophic forgetting is mitigated without domain-specific tuning.

Inference Efficiency: Quantization and Multi-Token Prediction

Inference throughput is optimized via Quantization-Aware Training (QAT), Post-Training Quantization (PTQ), and dense Multi-Token Prediction (MTP). QAT integrates simulated INT4 quantization, employing Straight-Through Estimation for stable gradients, and achieves robust rollouts even under aggressive bit-width reduction. PTQ experiments conducted with vLLM and TRT-LLM demonstrate significant throughput gains (up to 28% in FP8/W4AFP8 formats) with negligible accuracy loss, outperforming Qwen3-30B-A3B baselines despite larger model weights.

The released GGUF variants, with a novel DoubleQuant strategy, partition weight matrices into blocks for block-wise and global quantization, storing quantized scales at reduced precision. This enables high-fidelity inference on edge devices with effective accuracy retention.

JoyAI-LLM Flash’s dense MTP head achieves a speedup of 1.87× on speculative decoding benchmarks, surpassing models like GLM-5 and Step-3.5-Flash. Joint MTP-quantization optimization further boosts throughput (up to 1.96× in W4AFP8 formats), albeit with diminishing returns at high concurrency levels due to computational overhead. Real-world inference workloads are evaluated under both short-context and long-context scenarios, with deployment recommendations for intra-node aggregation, dynamic scaling, KV cache management trade-offs, and prefix reuse strategies.

Numerical Results and Token Efficiency

Comprehensive evaluation across open benchmarks (MMLU, HellaSwag, GPQA-Diamond, SuperGPQA, MATH-500, LiveCodeBench, SWE-bench, AlignBench, IFEval, RULER, LiveBench, τ²-Bench, PinchBench) highlights substantially improved token efficiency. For example, LiveCodeBench results show a 2.4% higher accuracy than GLM-4.7-Flash-Thinking with an 85% reduction in token usage. Conversely, on PinchBench, JoyAI-LLM Flash achieves best-in-class accuracy despite consuming more tokens, underscoring its capability for high-fidelity long-context reasoning. Across quantization and MTP variants, throughput and accuracy metrics consistently surpass baselines with strong trade-offs.

Practical and Theoretical Implications

JoyAI-LLM Flash establishes a new baseline for token-efficient, sparse inferencing in mid-scale LLMs. The FiberPO algorithm’s compositional trust-region regulation eliminates structural weaknesses in prior RL approaches, affording reliable scaling in heterogeneous, multi-domain, and agentic environments. The trainer–inference co-design (QAT, MTP, DoubleQuant) enables broad deployment, including edge and consumer-grade settings, without sacrificing high-level performance. The open-source release of model checkpoints across quantization formats ensures reproducibility and adoption.

Theoretical implications include a deeper understanding of multi-scale RL stability via fiber bundle decompositions, with practical ramifications for future agentic LLMs as alignment objectives and deployment constraints become more complex.

Outlook and Future Directions

The authors propose future research aimed at integrating continual learning and persistent memory to empower dynamic adaptation and retention in LLMs. This trajectory aligns with increasing application demands for open-ended agentic models capable of robust alignment, efficient reasoning, and modular tool integration. Further exploration of hierarchical policy optimization, scaling laws in sparse architectures, and advanced low-bit quantization strategies is anticipated to drive new developments in efficient, general-purpose LLMs.

Markdown Report Issue