MindSpeed-MLLM: Distributed Multimodal Training
- MindSpeed-MLLM is a distributed multimodal training framework that integrates hybrid parallelism, operator equivalence, and advanced data packaging for efficient vision-language model development.
- It employs a modular structure with dedicated libraries (MindSpeed-Core, MindSpeed-LLM, MindSpeed-MM) to manage text and image data streams using tensor, pipeline, and data parallelism.
- The framework achieves up to 2.2× faster throughput and strong scaling on Ascend NPUs, ensuring memory-efficient pre-training and accurate supervised tuning.
MindSpeed-MLLM is a distributed multimodal training framework developed specifically for efficient and accurate training of large-scale vision-LLMs on Ascend NPUs, as exemplified by its central role in the MindVL multimodal LLM (MLLM). By integrating hybrid parallelism, operator-equivalent replacements for Ascend hardware, multimodal data packaging, and advanced system-level optimizations, MindSpeed-MLLM provides an optimized foundation for high-throughput, memory-efficient pre-training and supervised instruction tuning of models that process both text and visually dense content. Its design draws heavily from Megatron-LM but extends and adapts core methodologies to fit the federated, heterogeneous compute environment of Ascend 910B devices (Chen et al., 15 Sep 2025).
1. System Architecture and Component Hierarchy
MindSpeed-MLLM is structured as a modular multimodal layer interfacing three specialized libraries:
- MindSpeed-Core (derived from Megatron-LM): backbone for hybrid parallelism, memory partitioning, and scheduling.
- MindSpeed-LLM: language modeling optimizations, including distributed transformer and optimizer adaptations.
- MindSpeed-MM: vision-specific model optimizations tailored for image patching and embedding.
Key architectural components:
- Distributed Multi-Modal Data Loader: Synchronizes partitioned text/visual data streams for efficient parallel input.
- Hybrid Parallel Engine (Megatron 3D): Integrates tensor parallelism, pipeline parallelism, and data parallelism across NPUs.
- Operator Fusion & Equivalence Layer: Implements hardware-specific operator replacements to maintain numerical correctness and efficient compute on Ascend hardware.
- System Scheduler & Runtime Tweaks: Dynamically adapts compute, memory, and communication strategies based on real-time profiling.
The following diagram depicts the logical hierarchy (text description):
1 2 3 4 5 6 7 8 9 10 11 12 |
┌────────────────────────────────────────────┐ │ MindSpeed-MLLM │ │ ├─ Distributed Multi-Modal Data Loader │ │ ├─ Hybrid Parallel Engine (Megatron 3D) │ │ │ ├─ Tensor Parallelism (MindSpeed-Core) │ │ ├─ Pipeline Parallelism (MindSpeed-Core) │ │ └─ Data Parallelism (HCCL) │ │ ├─ Operator Fusion & Equivalence Layer │ │ └─ System Scheduler & Runtime Tweaks │ ├────────────────────────────────────────────┤ │ MindSpeed-LLM MindSpeed-MM │ └────────────────────────────────────────────┘ |
A central memory-partitioning constraint governs allocation per device:
where is the total number of GPUs (NPUs).
2. Hybrid Parallelism Methodologies
MindSpeed-MLLM extensively utilizes 3D hybrid parallelism, following Megatron-LM paradigms:
- Tensor Parallelism: Model weight matrices are partitioned across devices. Each device holds of a weight matrix and computes partial results, synchronized with All-Reduce:
- Pipeline Parallelism: The transformer layers are split into pipeline stages. Microbatches flow through a 1-forward-1-backward (1F1B) schedule, incurring communication of
per microbatch ().
- Data Parallelism: The replicated full model is distributed across groups, synchronizing trainable via HCCL All-Reduce:
with local memory usage per device:
These degrees satisfy (total device count).
3. Operator Equivalence and Hardware Adaptation
To ensure functionality on Ascend NPUs (CANN stack), MindSpeed-MLLM applies operator replacements:
- FlashAttention is replaced by an NPU fusion attention kernel (
acl_op "AttentionFusion"). - Sparse/Dynamic Masking uses "mask compression," a variable-length fused call compatible with NPU specifications.
- Conv2d/Conv3d operations are recast as MatMul: convolution is implemented as im2col followed by GEMM, ensuring equivalent numerical results:
- Precision Management: Activations and weights use BF16; master copies and gradients reside in FP32:
These measures maintain accuracy parity with CUDA-based kernels while leveraging Ascend-specific optimizations.
4. Multimodal Data Packaging Strategies
MindSpeed-MLLM deploys an online multimodal data packaging procedure to efficiently mix variable-length visual and textual content within fixed-length sequences:
- Image preprocessing: Resize to nearest multiple of 28; split into 14-pixel stride patches; group every four spatially adjacent patches and project via a 2-layer MLP to .
- Text processing: Tokenize, pad/truncate to fixed length.
- Online sequence packing (per DP group): Combine up to samples with cumulative text length and visual embedding length , pad, and concatenate.
Pseudocode (excerpted):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def PACK_BATCH(dataset_shard, B, L_text, V_max): batch = [] cur_text_len, cur_vis_len = 0, 0 for sample in dataset_shard: tkn = tokenize(sample.text) vis = extract_patches(sample.image) if cur_text_len + len(tkn) <= L_text and cur_vis_len + len(vis) <= V_max: batch.append((tkn, vis)) cur_text_len += len(tkn) cur_vis_len += len(vis) if len(batch) == B: yield PAD_AND_FORMAT(batch) batch.clear(); cur_text_len, cur_vis_len = 0, 0 if batch: yield PAD_AND_FORMAT(batch) |
PAD_AND_FORMAT standardizes sequence lengths and modality concatenation.
5. Three-Phase Training Workflow and Hyperparameterization
MindSpeed-MLLM divides training into three sequential phases to incrementally build model capability:
| Stage | Tokens Budget | Seq. Length | Trainable | Batch Size | Max LR | Min LR | LR Warmup Ratio |
|---|---|---|---|---|---|---|---|
| Warm-up | 256B | 8192 | MLP adaptor | 1024 | 2e-4 | 2e-5 | 0.1 |
| Multitask | 179B | 8192 | All parameters | 1024 | 2e-5 | 1e-5 | 0.1 |
| SFT | 12B | 2048/4096/8192 | All parameters | 512 | 1e-5 | 0 | 0.1 |
- Warm-up: Aligns vision encoder and MLP adaptor to LLM embedding space using image-caption, OCR, visual grounding, and STEM data; updates only adapter weights.
- Multitask pre-training: All parameters trainable; data comprises a mix of image-text, QA, OCR, chart/table, video, and pure text. Aggregate multitask loss:
- Supervised instruction tuning (SFT): Human-annotated instructions mixed with language-only data (DeepSeek R1, ratio ≈ 1:1). Within-sample loss averaging balances short/long response distributions.
6. Performance Scaling and Empirical Outcomes
MindSpeed-MLLM demonstrates efficient scaling and throughput:
- On 1024 Ascend 910B devices:
- MFU ≈ 40% on both 8B and 32B models.
- For the 8B model with 8192-token sequences and batch size 1024, end-to-end throughput achieves substantial sample and token rates (exact figures withheld in the data).
- Near-linear strong scaling to 512 cards; weak scaling efficiency >90% up to 1024 cards.
- Relative to baselines (MindSpeed-MM or LlamaFactory + CANN):
- 1.8×–2.2× faster end-to-end training throughput.
- Convergence-matched loss curves (within 1% MAE/MRE of baseline curves).
These outcomes confirm the effectiveness of MindSpeed-MLLM in large-scale multimodal training on Ascend hardware (Chen et al., 15 Sep 2025).
7. Advanced Evaluation and Model Optimization Techniques
Two advanced post-training procedures further optimize MindVL and related models:
- Test-time Resolution Search: A grid search over (ranges of image pixel counts, e.g., , ):
- Images are upsampled if or downsampled if .
- For each pair, validation accuracy is measured.
- The optimal range is selected as:
- Model Weight Averaging: Training checkpoints from SFT runs at sequence lengths are averaged:
Optionally, exponential moving average (EMA) is applied:
These strategies refine generalization performance and accuracy consistency across diverse input resolutions and sequence lengths.
The MindSpeed-MLLM framework synthesizes advanced distributed multimodal optimization, precision alignment, hardware-specific operator equivalence, and high-efficiency data manipulation. These features collectively underpin data- and compute-efficient training of large vision-LLMs on Ascend NPUs, establishing strong parity or improvement over contemporary GPU-centric pipelines while achieving comparable accuracy with a fraction of the training data (Chen et al., 15 Sep 2025).