MindSpeed-MLLM: Distributed Multimodal Training

Updated 13 February 2026

MindSpeed-MLLM is a distributed multimodal training framework that integrates hybrid parallelism, operator equivalence, and advanced data packaging for efficient vision-language model development.
It employs a modular structure with dedicated libraries (MindSpeed-Core, MindSpeed-LLM, MindSpeed-MM) to manage text and image data streams using tensor, pipeline, and data parallelism.
The framework achieves up to 2.2× faster throughput and strong scaling on Ascend NPUs, ensuring memory-efficient pre-training and accurate supervised tuning.

MindSpeed-MLLM is a distributed multimodal training framework developed specifically for efficient and accurate training of large-scale vision-LLMs on Ascend NPUs, as exemplified by its central role in the MindVL multimodal LLM (MLLM). By integrating hybrid parallelism, operator-equivalent replacements for Ascend hardware, multimodal data packaging, and advanced system-level optimizations, MindSpeed-MLLM provides an optimized foundation for high-throughput, memory-efficient pre-training and supervised instruction tuning of models that process both text and visually dense content. Its design draws heavily from Megatron-LM but extends and adapts core methodologies to fit the federated, heterogeneous compute environment of Ascend 910B devices (Chen et al., 15 Sep 2025).

1. System Architecture and Component Hierarchy

MindSpeed-MLLM is structured as a modular multimodal layer interfacing three specialized libraries:

MindSpeed-Core (derived from Megatron-LM): backbone for hybrid parallelism, memory partitioning, and scheduling.
MindSpeed-LLM: language modeling optimizations, including distributed transformer and optimizer adaptations.
MindSpeed-MM: vision-specific model optimizations tailored for image patching and embedding.

Key architectural components:

Distributed Multi-Modal Data Loader: Synchronizes partitioned text/visual data streams for efficient parallel input.
Hybrid Parallel Engine (Megatron 3D): Integrates tensor parallelism, pipeline parallelism, and data parallelism across NPUs.
Operator Fusion & Equivalence Layer: Implements hardware-specific operator replacements to maintain numerical correctness and efficient compute on Ascend hardware.
System Scheduler & Runtime Tweaks: Dynamically adapts compute, memory, and communication strategies based on real-time profiling.

The following diagram depicts the logical hierarchy (text description):

$\frac{1}{G}$ 3

A central memory-partitioning constraint governs allocation per device:

$M_{\text{model\_shard}} = \frac{M_{\text{total\_params}}}{P_{\text{tensor}}} + M_{\text{activations\_per\_microbatch}}$

where $P$ is the total number of GPUs (NPUs).

2. Hybrid Parallelism Methodologies

MindSpeed-MLLM extensively utilizes 3D hybrid parallelism, following Megatron-LM paradigms:

Tensor Parallelism: Model weight matrices are partitioned across $G$ devices. Each device holds $\frac{1}{G}$ of a weight matrix and computes partial results, synchronized with All-Reduce:

$\Delta W = \sum_{g=1}^G \Delta W_g,\quad \text{Comm}_{\text{tensor}} \approx (G-1)\,\frac{|W|}{G}$

Pipeline Parallelism: The $L$ transformer layers are split into $S$ pipeline stages. Microbatches flow through a 1-forward-1-backward (1F1B) schedule, incurring communication of

$\text{Comm}_{\text{pipe}} = (S-1)\times B_m\times A_{\text{hidden}}$

per microbatch ( $B_m$ ).

Data Parallelism: The replicated full model is distributed across $D$ groups, synchronizing trainable $P$ 0 via HCCL All-Reduce:

$P$ 1

with local memory usage per device:

$P$ 2

These degrees satisfy $P$ 3 (total device count).

3. Operator Equivalence and Hardware Adaptation

To ensure functionality on Ascend NPUs (CANN stack), MindSpeed-MLLM applies operator replacements:

FlashAttention is replaced by an NPU fusion attention kernel (acl_op "AttentionFusion").
Sparse/Dynamic Masking uses "mask compression," a variable-length fused call compatible with NPU specifications.
Conv2d/Conv3d operations are recast as MatMul: convolution is implemented as im2col followed by GEMM, ensuring equivalent numerical results:

$P$ 4

Precision Management: Activations and weights use BF16; master copies and gradients reside in FP32:

$P$ 5

These measures maintain accuracy parity with CUDA-based kernels while leveraging Ascend-specific optimizations.

4. Multimodal Data Packaging Strategies

MindSpeed-MLLM deploys an online multimodal data packaging procedure to efficiently mix variable-length visual and textual content within fixed-length sequences:

Image preprocessing: Resize to nearest multiple of 28; split into 14-pixel stride patches; group every four spatially adjacent patches and project via a 2-layer MLP to $P$ 6.
Text processing: Tokenize, pad/truncate to fixed length.
Online sequence packing (per DP group): Combine up to $P$ 7 samples with cumulative text length $P$ 8 and visual embedding length $P$ 9, pad, and concatenate.

Pseudocode (excerpted):

$\frac{1}{G}$ 4 PAD_AND_FORMAT standardizes sequence lengths and modality concatenation.

5. Three-Phase Training Workflow and Hyperparameterization

MindSpeed-MLLM divides training into three sequential phases to incrementally build model capability:

Stage	Tokens Budget	Seq. Length	Trainable	Batch Size	Max LR	Min LR	LR Warmup Ratio
Warm-up	256B	8192	MLP adaptor	1024	2e-4	2e-5	0.1
Multitask	179B	8192	All parameters	1024	2e-5	1e-5	0.1
SFT	12B	2048/4096/8192	All parameters	512	1e-5	0	0.1

Warm-up: Aligns vision encoder and MLP adaptor to LLM embedding space using image-caption, OCR, visual grounding, and STEM data; updates only adapter weights.
Multitask pre-training: All parameters trainable; data comprises a mix of image-text, QA, OCR, chart/table, video, and pure text. Aggregate multitask loss:

$G$ 0

Supervised instruction tuning (SFT): Human-annotated instructions mixed with language-only data (DeepSeek R1, ratio ≈ 1:1). Within-sample loss averaging balances short/long response distributions.

6. Performance Scaling and Empirical Outcomes

MindSpeed-MLLM demonstrates efficient scaling and throughput:

On 1024 Ascend 910B devices:
- MFU ≈ 40% on both 8B and 32B models.
- For the 8B model with 8192-token sequences and batch size 1024, end-to-end throughput achieves substantial sample and token rates (exact figures withheld in the data).
- Near-linear strong scaling to 512 cards; weak scaling efficiency >90% up to 1024 cards.
Relative to baselines (MindSpeed-MM or LlamaFactory + CANN):
- 1.8×–2.2× faster end-to-end training throughput.
- Convergence-matched loss curves (within 1% MAE/MRE of baseline curves).

These outcomes confirm the effectiveness of MindSpeed-MLLM in large-scale multimodal training on Ascend hardware (Chen et al., 15 Sep 2025).

7. Advanced Evaluation and Model Optimization Techniques

Two advanced post-training procedures further optimize MindVL and related models:

Test-time Resolution Search: A grid search over $G$ $G$ 1 (ranges of image pixel counts, e.g., $G$ $G$ 2, $G$ $G$ 3):
- Images are upsampled if $G$ 4 $G$ 5 or downsampled if $G$ 6 $G$ 7.
- For each pair, validation accuracy $G$ 8 is measured.
- The optimal range is selected as:
$G$ 9
Model Weight Averaging: Training checkpoints from SFT runs at sequence lengths $\frac{1}{G}$ 0 are averaged:

$\frac{1}{G}$ 1

Optionally, exponential moving average (EMA) is applied:

$\frac{1}{G}$ 2

These strategies refine generalization performance and accuracy consistency across diverse input resolutions and sequence lengths.

The MindSpeed-MLLM framework synthesizes advanced distributed multimodal optimization, precision alignment, hardware-specific operator equivalence, and high-efficiency data manipulation. These features collectively underpin data- and compute-efficient training of large vision-LLMs on Ascend NPUs, establishing strong parity or improvement over contemporary GPU-centric pipelines while achieving comparable accuracy with a fraction of the training data (Chen et al., 15 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MindSpeed-MLLM.