Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 80 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs (2509.11662v1)

Published 15 Sep 2025 in cs.CV, cs.AI, cs.CL, and eess.IV

Abstract: We propose MindVL, a multimodal large langauge model trained on Ascend NPUs. Similar to Qwen2.5-VL, MindVL adopts native-resolution Vision Transformers, which enables it to process images at their original variable resolutions. This design avoids the degradation caused by fixed-resolution tiling while preserving fine-grained details and global layouts, which is crucial for visually dense content such as complex charts and diagrams. To ensure the smooth training of MindVL on Ascend NPUs, we develop Mindspeed-MLLM, a distributed multimodal training framework tailored for Ascend NPUs. To maintain training accuracy, we implement equivalent replacements for certain operators. MindVL undergoes a three-phase training process, namely the warm-up phase, multitask training phase, and supervised instruction tuning phase, to gradually enhance its capabilities. This process starts with basic visual and multimodal pre-training, followed by large-scale multiask trainging and instruction tuning. We also adopt multimodal data packaging and hybrid parallelism techniques, which significantly improve end-to-end training speed. To further boost model performance, we specifically introduce test-time resolution search and model weight averaging. Notably, despite using about 1/10 of the training data required by Qwen2.5-VL, MindVL achieves performance on par with Qwen2.5-VL in evaluations of general multimodal understanding and document/table comprehension. Beyond overall scores, MindVL also delivers leading performance in OCR assessments.

Summary

The paper presents MindVL, which efficiently trains multimodal LLMs on Ascend NPUs using a native-resolution ViT architecture and a three-phase training pipeline.
It introduces MindSpeed-MLLM, a distributed framework that employs operator fusion and system-level scheduling to optimize multimodal data processing.
Empirical results demonstrate state-of-the-art performance on OCR and document tasks while utilizing only one-tenth of the training data compared to prior models.

MindVL: Efficient Multimodal LLM Training on Ascend NPUs

Introduction

The paper presents MindVL, a multimodal LLM (MLLM) designed for efficient and effective training on Ascend NPUs. MindVL leverages a native-resolution Vision Transformer (ViT) architecture, enabling direct processing of images at their original resolutions and aspect ratios. This approach circumvents the limitations of fixed-resolution tiling, preserving both fine-grained details and global spatial layouts, which is critical for tasks involving visually dense content such as charts, tables, and documents. The work addresses two major gaps in the current MLLM landscape: the lack of high-performance models trained on non-NVIDIA hardware and the challenge of achieving competitive performance with reduced training data.

Model Architecture and Training Pipeline

MindVL's architecture consists of three primary components: a native-resolution ViT visual encoder, a two-layer MLP projector, and a LLM backbone. The ViT employs 2D RoPE positional encoding and processes images by patching with a stride of 14, followed by spatial grouping and projection to the LLM embedding space. The visual encoder is initialized from Qwen2.5ViT weights, while the LLM backbone utilizes Qwen3 or DeepSeek V3-0324.

The training pipeline is structured into three phases:

Warm-up Phase: Only the MLP adaptor is trained to align the ViT and LLM, using curated image captions, visual grounding, OCR, and STEM data.
Multitask Training Phase: All model parameters are unfrozen, and the model is exposed to diverse multimodal data, including interleaved image-text, video, VQA, mathematics, and agent-based tasks.
Supervised Fine-tuning (SFT) Phase: Targeted instruction optimization is performed to bridge the gap between pretraining and downstream task requirements.

The data curation process involves extensive filtering, clustering, and annotation to maximize diversity and relevance, particularly for long-tail visual concepts and text-rich content.

Ascend NPU Training Infrastructure: MindSpeed-MLLM

To facilitate efficient training on Ascend NPUs, the authors introduce MindSpeed-MLLM, a distributed multimodal training framework. MindSpeed-MLLM integrates optimizations from MindSpeed-Core, MindSpeed-LLM, and MindSpeed-MM, with additional enhancements for multimodal data processing, operator fusion, and system-level scheduling.

Key features include:

Distributed Multi-Modal Data Loader: Supports group-wise data loading and online packing to minimize I/O bottlenecks and balance pipeline stages.
Operator Fusion Replacement: Implements hardware-friendly fused operators (e.g., RMSNorm, SoftMax, FlashAttention) and replaces inefficient Conv operations with Matmul.
System-Level Scheduling: Employs fine-grained core binding and operator deployment overlap to reduce NUMA overhead and improve throughput.

Precision alignment experiments demonstrate that MindSpeed-MLLM achieves loss curves and benchmark results comparable to mainstream NVIDIA-based frameworks, with mean absolute and relative errors within acceptable bounds.

Figure 1: The hierarchical architecture of MindSpeed-MLLM, integrating core optimizations and multimodal data processing for Ascend NPUs.

Data Efficiency and Model Enhancement Techniques

MindVL achieves competitive performance using only 447B training tokens—approximately one-tenth the data required by Qwen2.5-VL. This is enabled by two key strategies:

Model Weight Averaging: Weights from models trained with different sequence lengths and image resolutions are averaged, enhancing robustness and generalization.
Test-Time Resolution Search: A grid search over input image resolutions is performed during inference, dynamically selecting optimal scaling parameters to mitigate distributional shifts between training and test data.
Figure 2: Benchmark performance of MindVL and its counterparts, highlighting MindVL's data efficiency and competitive results.

Experimental Results

MindVL is evaluated on a suite of multimodal benchmarks, including MMBench, MME, OCRBench, DocVQA, ChartQA, and InfoVQA. The model consistently outperforms leading open-source and closed-source models, including Qwen2.5-VL 7B, GLM-4.1V 9B, Keya-VL 8B, and InternVL3.5 8B, despite its reduced training data budget. Notably, MindVL achieves state-of-the-art results in OCR-related tasks, demonstrating superior handling of text-rich visual content.

Ablation studies confirm the importance of interleaved image-text data and multitask training for overall performance. The three-phase training pipeline yields the best results, and model merging across sequence lengths further improves accuracy and robustness to input resolution variations.

Figure 3: Box plots of accuracies for different models across varying input image resolutions, illustrating MindVL's robustness and superior median scores.

Figure 4: Evaluation results on MME with different input image resolutions, showing consistent performance across scaling parameters.

Figure 5: Evaluation results on OCRBench with different input image resolutions, highlighting the impact of resolution search on OCR accuracy.

Figure 6: Evaluation results on DocVQA with different input image resolutions, demonstrating stable performance for high-resolution inputs.

Figure 7: Evaluation results on InfoVQA with different input image resolutions, confirming optimal inference settings for infographic tasks.

Training Framework Validation

Loss curves on public and in-house datasets validate the precision and stability of MindSpeed-MLLM. The framework achieves near-perfect overlap in loss values with MindSpeed-MM when data loading is standardized, and maintains error margins within ±1.5 on general benchmarks compared to Llama-Factory.

Figure 8: Comparison of loss values among various training tools and platforms on the COCO dataset, demonstrating alignment of MindSpeed-MLLM with established frameworks.

Figure 9: Loss decline trend on the in-house slow thinking dataset, confirming training stability.

Figure 10: Loss difference between MindSpeed-MLLM and MindSpeed-MM, showing minimal discrepancies attributable to fused operator randomness.

Limitations and Future Directions

The current MindVL implementation is constrained by computational resources, limiting exploration of smaller and larger model variants and Mixture-of-Experts architectures. Data limitations preclude trillion-token scale experiments and comprehensive evaluation across grounding, reasoning, agent-based, and language-focused tasks. Preliminary results with MindVL-DeepSeek-V3 (671B MoE backbone) indicate promising scaling efficiency, with strong performance using only 26B tokens.

Future work will focus on scaling model size and data, expanding evaluation scope, and further enhancing multimodal capabilities, particularly for resource-constrained and edge deployment scenarios.

Conclusion

MindVL demonstrates that high-performance multimodal LLMs can be efficiently trained on Ascend NPUs, achieving competitive results with significantly reduced data budgets. The integration of native-resolution ViT architectures, advanced data curation, model weight averaging, and test-time resolution search collectively contribute to its robustness and versatility. MindSpeed-MLLM provides a validated, high-performance training framework for the Ascend ecosystem, enabling broader accessibility and scalability of MLLMs beyond NVIDIA hardware. The findings have practical implications for cost-effective model development and deployment in diverse hardware environments, and set the stage for future research in efficient multimodal learning and hardware-aware optimization.