MiniCPM: Efficient Open-Source LLM & MLLM

Updated 8 January 2026

MiniCPM is a family of compact language models that feature modified Transformer architectures and multimodal fusion for both text and vision tasks.
They employ scalable training strategies, μ-parametrization, and aggressive quantization to reduce computational resources and boost efficiency.
Empirical results show MiniCPM models matching or outperforming larger models on benchmarks while enabling practical deployment on cloud and edge devices.

MiniCPM is a family of open-source small LLMs (SLMs) and multimodal LLMs (MLLMs) engineered for efficiency, scalability, and broad deployment on both cloud and end-side devices. Across successive generations—MiniCPM, MiniCPM-V, MiniCPM4, and MiniCPM-V 4.5—the series advances compact Transformer architectures, scalable data/model training regimes, multi-modal fusion mechanisms, and specialized deployment strategies including aggressive quantization and sparse attention. Characterized by compute-efficient training and inference, MiniCPM models consistently match or outperform substantially larger open-source and proprietary models on standard benchmarks, while maintaining low resource footprints (Hu et al., 2024, Team et al., 9 Jun 2025, Yao et al., 2024, Yu et al., 16 Sep 2025).

1. Architectural Foundations and Model Variants

MiniCPM models are based on canonical Transformer architectures with systematic modifications for size, depth, attention type, and multimodal fusion. The foundational variants, MiniCPM-1.2B (52 layers, 1,536 hidden, 3,840 ff, 24q/8kv heads) and MiniCPM-2.4B (40 layers, 2,304 hidden, 5,760 ff, 36q/36kv heads), use extensive model wind tunnel experiments and μ-parametrization for stable hyper-parameter transfer across scales (Hu et al., 2024).

Subsequent multimodal extensions—MiniCPM-V and MiniCPM-V-2.6—integrate visual encoders (ViT-based), Perceiver-style cross-attentive compressors, and deep LLMs (up to Llama3-Instruct 8B). MiniCPM-V-2.6 is structurally identical to MiniCPM-V with 32 Transformer layers (hidden size 4096, 32 attention heads) and interleaved self- and cross-attention over vision–language inputs. MiniCPM-V 4.5 introduces a unified 3D-Resampler for highly compressed tokenization of both images and videos, applying parameter sharing and compact cross-attention to reduce per-frame visual token cost from 64 to 21 and enable efficient multi-package encoding (Yu et al., 16 Sep 2025).

Mixture-of-Experts (MiniCPM-MoE) and sparse attention configurations (MiniCPM4 with InfLLM v2) further enhance efficiency: MiniCPM-MoE wraps each Transformer layer in 8 experts, activating 2/8 per token, yielding a 13.6B model with a 4B active subnet capable of outperforming Llama 2-34B. MiniCPM4 implements block- and semantic-kernel sparse attention for both prefilling and decoding, drastically reducing computation for long contexts (Team et al., 9 Jun 2025).

2. Scalable Training Strategies and Data–Model Regimes

MiniCPM’s training methodology is built upon the “model wind tunnel” strategy: hyper-parameters are searched on sub-100M models and transferred to high-capacity variants via μ-parametrization and batched scaling laws. The Warmup–Stable–Decay (WSD) scheduler partitions training into distinct exploration, stability, and annealing phases. WSD enables efficient study of data–model scaling laws, revealing a compute-optimal data-to-model ratio (D_opt/N_opt ≈ 192) far larger than the Chinchilla prescription (~20), showing that SLMs require vastly more data per parameter for optimal efficiency (Hu et al., 2024).

MiniCPM4 applies UltraClean filtering for pre-training data selection (using loss-improving “positive seed” detection and fastText classifiers), and UltraChat v2 for multi-turn, multi-capability SFT data. Reinforcement learning algorithms employ chunk-wise rollout and group RPO with clipping/KL objectives for efficient sampling. BitCPM4, a ternary quantization-aware variant, transitions model weights to {-1,0,+1} using continual QAT over two stages, maintaining competitive accuracy. ModelTunnel v2 and ScalingBench enable rapid pre-training strategy search by directly correlating scaling-bench loss with downstream accuracy, vastly reducing resource requirements for training regime selection (Team et al., 9 Jun 2025).

Supervised and preference fine-tuning with Direct Preference Optimization (DPO) enhances alignment and factuality. In multimodal variants, vision-language pretraining leverages staged resolution extension, high-res/OCR data mix, and joint SFT + RLAIF-V alignment.

3. Multimodal and Extended Context Models

MiniCPM extends to MLLMs via the MiniCPM-V family, including MiniCPM-V-2.6 and MiniCPM-V 4.5. These models deploy strategies such as interleaved vision–language Transformer blocks and unified 3D-Resamplers for scalable and efficient multi-modal token fusion. The encoding pipeline compresses spatial-temporal features into fixed-length tokens, facilitating video and high-res image understanding with minimal memory usage.

The MiniCPM-128K variant removes embedding sharing, applies ABF and NTK-aware rotary position encoding, and mixes synthetic long-QA data for ultra-long context (up to 128K tokens). MiniCPM4 further exploits sparse attention for long-context processing—block-sparse partitioning and semantic-kernel selection reduce theoretical and measured FLOPs for prefilling/decoding, achieving up to 7× speedup over Qwen3-8B at 128K context (Team et al., 9 Jun 2025).

4. Optimization for On-Device and Efficient Inference

MiniCPM places significant emphasis on mobile and efficient deployment. Aggressive quantization techniques (4-bit GGML, ternary QAT), memory usage and compilation optimizations, and NPU offloading enable end-to-end vision–language inference at low latency and low memory (e.g., ~3.7s encoding and ~8.2 tok/s decoding at 5GB RAM on Xiaomi 14 Pro, Snapdragon 8 Gen 3) (Yao et al., 2024). CPM.cu, a bespoke CUDA-based inference engine, integrates sparse attention, speculative sampling (EAGLE-2, FR-Spec), and prefix-aware GPTQ for INT4 post-training quantization (Team et al., 9 Jun 2025). The full/model (“MiniCPM4-8B”) and light/model (“MiniCPM4-0.5B”) variants scale from ~2GB to ~15GB footprints.

MiniCPM-V models optimize for real-world usability: adaptive slicing and token compression allow any aspect ratio up to 1.8M pixels; sequential module loading and exhaustive device parameter tuning maximize throughput; offloaded visual encoding via NPU further reduces latency. These measures make GPT-4V–level multimodal performance feasible on phones and consumer hardware (Yao et al., 2024).

5. Empirical Results and Benchmarking

MiniCPM models routinely achieve or surpass the performance of much larger LLMs. MiniCPM-2.4B beats Mistral-7B and Llama2-13B on Chinese benchmarks and outperforms all SLMs ( <7B) on language and math/code. MiniCPM-DPO exceeds ChatGLM2-6B and Mistral-7B-Instruct on QA and chat tasks (Hu et al., 2024). MiniCPM-MoE’s 13.6B parameter model matches Llama2-34B on MMLU, CEval, GSM8K.

MiniCPM-V 2.5 exceeds GPT-4V-1106 and Gemini Pro on OpenCompass (65.1 vs. 63.5/62.9), with lower hallucination rates and higher OCR accuracy (OCRBench 725 vs. 645/680). MiniCPM-V 4.5 achieves 77.0 on OpenCompass and 67.9 on Video-MME, with only 46.7% GPU memory and 8.7% inference time of Qwen2.5-VL 7B (Yu et al., 16 Sep 2025). Hybrid RL, unified doc/OCR objectives, and 3D-Resampler compression drive these results.

Contrastive fine-tuning turns MiniCPM into a high-quality sentence embedder: InfoNCE-based fine-tuning on NLI triplets yields an average 56.33 percentage point gain over strong baselines in nine STS tasks. MiniCPM’s deeper architecture and NLI-aligned pretraining enhance its contrastive responsiveness over peers such as Gemma and Phi-2 (Ukarapol et al., 2024).

MiniCPM4’s lightweight and full variants outperform Qwen3-0.6B and match/exceed Phi-4-14B, Gemma3-12B, and larger LLMs on MMLU, CMMLU, CEval, BBH, GSM8K, even with only 8T training tokens (vs. 36T for Qwen3) (Team et al., 9 Jun 2025).

6. Task Decomposition, Training Recipes, and Meta-Evaluation

MiniCPM-V-2.6 and related models introduce a task-decomposed evaluation framework for text-to-image generation. The evaluation task is decoupled into six fine-grained objectives—content extraction, individual QA, explanation, scoring, and summary rationale—facilitating distilled training from commercial MLLMs like GPT-4o into a 7B open-source multimodal Transformer. Each sub-task is associated with a cross-entropy loss, sampled uniformly and aggregated for total optimization.

A meta-evaluation benchmark with chain-of-thought explanations and expert-annotated scores is provided. Experimental results show MiniCPM-V-2.6 fine-tuned on GPT-4o-distilled data outperforms the GPT-4o base by over 4.6% in Spearman and Kendall correlation with human judgments. Ablations demonstrate task decomposition and explanation/score separation are critical: merging caption/answer/explanation drops correlation ρ from 0.505→0.404, and omitting stages reduces agreement by 15–30% (Tu et al., 2024).

7. Applications, Limitations, and Extension Directions

MiniCPM facilitates broad applications: trustworthy survey generation (plan–retrieve–write via RL), tool use via Model Context Protocol (MCP), factual QA, long-context reasoning, and robust OCR/document understanding. MiniCPM4-Survey matches OpenAI DeepResearch in relevance and depth, with doubled fact precision; MiniCPM4-MCP attains name/param/value accuracy exceeding GPT-4o. Multilingual tuning extends MiniCPM-V to 30+ languages—e.g., French 72.7, Japanese 88.0 post-SFT (Yao et al., 2024).

Documented limitations include reduced relational consistency due to data scarcity, sensitivity to extreme image quality, and remaining domain shifts in biomedical STS evaluation. Future directions focus on extending memory-efficient architectures to larger models, integrating continuous context adaptation, exploring multi-lingual and two-tower encoders, and broadening retrieval-augmented and multi-modal deployments.

The MiniCPM family, through systematic innovations in model design, scalable training, multimodal fusion, and aggressive optimization, defines efficient LLM and MLLM deployment for resource-constrained environments—serving both as state-of-the-art small-model agents for practical applications and as wind-tunnel proxies guiding next-generation LLM research (Hu et al., 2024, Team et al., 9 Jun 2025, Tu et al., 2024, Yao et al., 2024, Yu et al., 16 Sep 2025, Ukarapol et al., 2024).