Papers
Topics
Authors
Recent
Search
2000 character limit reached

MindSpeed-MLLM: Distributed Multimodal Training

Updated 13 February 2026
  • MindSpeed-MLLM is a distributed multimodal training framework that integrates hybrid parallelism, operator equivalence, and advanced data packaging for efficient vision-language model development.
  • It employs a modular structure with dedicated libraries (MindSpeed-Core, MindSpeed-LLM, MindSpeed-MM) to manage text and image data streams using tensor, pipeline, and data parallelism.
  • The framework achieves up to 2.2× faster throughput and strong scaling on Ascend NPUs, ensuring memory-efficient pre-training and accurate supervised tuning.

MindSpeed-MLLM is a distributed multimodal training framework developed specifically for efficient and accurate training of large-scale vision-LLMs on Ascend NPUs, as exemplified by its central role in the MindVL multimodal LLM (MLLM). By integrating hybrid parallelism, operator-equivalent replacements for Ascend hardware, multimodal data packaging, and advanced system-level optimizations, MindSpeed-MLLM provides an optimized foundation for high-throughput, memory-efficient pre-training and supervised instruction tuning of models that process both text and visually dense content. Its design draws heavily from Megatron-LM but extends and adapts core methodologies to fit the federated, heterogeneous compute environment of Ascend 910B devices (Chen et al., 15 Sep 2025).

1. System Architecture and Component Hierarchy

MindSpeed-MLLM is structured as a modular multimodal layer interfacing three specialized libraries:

  • MindSpeed-Core (derived from Megatron-LM): backbone for hybrid parallelism, memory partitioning, and scheduling.
  • MindSpeed-LLM: language modeling optimizations, including distributed transformer and optimizer adaptations.
  • MindSpeed-MM: vision-specific model optimizations tailored for image patching and embedding.

Key architectural components:

  • Distributed Multi-Modal Data Loader: Synchronizes partitioned text/visual data streams for efficient parallel input.
  • Hybrid Parallel Engine (Megatron 3D): Integrates tensor parallelism, pipeline parallelism, and data parallelism across NPUs.
  • Operator Fusion & Equivalence Layer: Implements hardware-specific operator replacements to maintain numerical correctness and efficient compute on Ascend hardware.
  • System Scheduler & Runtime Tweaks: Dynamically adapts compute, memory, and communication strategies based on real-time profiling.

The following diagram depicts the logical hierarchy (text description):

1
2
3
4
5
6
7
8
9
10
11
12
┌────────────────────────────────────────────┐
│ MindSpeed-MLLM                            │
│  ├─ Distributed Multi-Modal Data Loader    │
│  ├─ Hybrid Parallel Engine (Megatron 3D)   │
│  │    ├─ Tensor Parallelism (MindSpeed-Core)
│  │    ├─ Pipeline Parallelism (MindSpeed-Core)
│  │    └─ Data Parallelism (HCCL)           │
│  ├─ Operator Fusion & Equivalence Layer    │
│  └─ System Scheduler & Runtime Tweaks      │
├────────────────────────────────────────────┤
│ MindSpeed-LLM       MindSpeed-MM           │
└────────────────────────────────────────────┘

A central memory-partitioning constraint governs allocation per device:

Mmodel_shard=Mtotal_paramsPtensor+Mactivations_per_microbatchM_{\text{model\_shard}} = \frac{M_{\text{total\_params}}}{P_{\text{tensor}}} + M_{\text{activations\_per\_microbatch}}

where PP is the total number of GPUs (NPUs).

2. Hybrid Parallelism Methodologies

MindSpeed-MLLM extensively utilizes 3D hybrid parallelism, following Megatron-LM paradigms:

  • Tensor Parallelism: Model weight matrices are partitioned across GG devices. Each device holds 1G\frac{1}{G} of a weight matrix and computes partial results, synchronized with All-Reduce:

ΔW=g=1GΔWg,Commtensor(G1)WG\Delta W = \sum_{g=1}^G \Delta W_g,\quad \text{Comm}_{\text{tensor}} \approx (G-1)\,\frac{|W|}{G}

  • Pipeline Parallelism: The LL transformer layers are split into SS pipeline stages. Microbatches flow through a 1-forward-1-backward (1F1B) schedule, incurring communication of

Commpipe=(S1)×Bm×Ahidden\text{Comm}_{\text{pipe}} = (S-1)\times B_m\times A_{\text{hidden}}

per microbatch (BmB_m).

  • Data Parallelism: The replicated full model is distributed across DD groups, synchronizing trainable Θ\Theta via HCCL All-Reduce:

Commdata(D1)×Θ\text{Comm}_{\text{data}} \approx (D-1)\times |\Theta|

with local memory usage per device:

Mlocal=ΘGD+actsS+overheadoptimizerM_{\text{local}} = \frac{|\Theta|}{G D} + \frac{|\text{acts}|}{S} + \text{overhead}_{\text{optimizer}}

These degrees satisfy G×S×D=PG\times S\times D = P (total device count).

3. Operator Equivalence and Hardware Adaptation

To ensure functionality on Ascend NPUs (CANN stack), MindSpeed-MLLM applies operator replacements:

  • FlashAttention is replaced by an NPU fusion attention kernel (acl_op "AttentionFusion").
  • Sparse/Dynamic Masking uses "mask compression," a variable-length fused call compatible with NPU specifications.
  • Conv2d/Conv3d operations are recast as MatMul: convolution is implemented as im2col followed by GEMM, ensuring equivalent numerical results:

Conv2d(X,W)MatMul(im2col(X),Wreshaped)\text{Conv2d}(X,W)\equiv \text{MatMul}(\mathrm{im2col}(X),\,W_{\mathrm{reshaped}})

  • Precision Management: Activations and weights use BF16; master copies and gradients reside in FP32:

XBF16XFP32;WBF16,ΔWBF16WFP32X_{BF16}\to X_{FP32}; \quad W_{BF16},\,\Delta W_{BF16}\to W_{FP32}

These measures maintain accuracy parity with CUDA-based kernels while leveraging Ascend-specific optimizations.

4. Multimodal Data Packaging Strategies

MindSpeed-MLLM deploys an online multimodal data packaging procedure to efficiently mix variable-length visual and textual content within fixed-length sequences:

  • Image preprocessing: Resize to nearest multiple of 28; split into 14-pixel stride patches; group every four spatially adjacent patches and project via a 2-layer MLP to vRdv \in \mathbb{R}^d.
  • Text processing: Tokenize, pad/truncate to fixed length.
  • Online sequence packing (per DP group): Combine up to BB samples with cumulative text length Ltext\leq L_\text{text} and visual embedding length Vmax\leq V_\text{max}, pad, and concatenate.

Pseudocode (excerpted):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def PACK_BATCH(dataset_shard, B, L_text, V_max):
    batch = []
    cur_text_len, cur_vis_len = 0, 0
    for sample in dataset_shard:
        tkn = tokenize(sample.text)
        vis = extract_patches(sample.image)
        if cur_text_len + len(tkn) <= L_text and cur_vis_len + len(vis) <= V_max:
            batch.append((tkn, vis))
            cur_text_len += len(tkn)
            cur_vis_len += len(vis)
        if len(batch) == B:
            yield PAD_AND_FORMAT(batch)
            batch.clear(); cur_text_len, cur_vis_len = 0, 0
    if batch:
        yield PAD_AND_FORMAT(batch)
PAD_AND_FORMAT standardizes sequence lengths and modality concatenation.

5. Three-Phase Training Workflow and Hyperparameterization

MindSpeed-MLLM divides training into three sequential phases to incrementally build model capability:

Stage Tokens Budget Seq. Length Trainable Batch Size Max LR Min LR LR Warmup Ratio
Warm-up 256B 8192 MLP adaptor 1024 2e-4 2e-5 0.1
Multitask 179B 8192 All parameters 1024 2e-5 1e-5 0.1
SFT 12B 2048/4096/8192 All parameters 512 1e-5 0 0.1
  • Warm-up: Aligns vision encoder and MLP adaptor to LLM embedding space using image-caption, OCR, visual grounding, and STEM data; updates only adapter weights.
  • Multitask pre-training: All parameters trainable; data comprises a mix of image-text, QA, OCR, chart/table, video, and pure text. Aggregate multitask loss:

L=itasksαiLi,Li=logp(yixi)\mathcal{L} = \sum_{i\in\text{tasks}} \alpha_i\mathcal{L}_i, \quad \mathcal{L}_i = -\sum \log p(y_i|x_i)

  • Supervised instruction tuning (SFT): Human-annotated instructions mixed with language-only data (DeepSeek R1, ratio ≈ 1:1). Within-sample loss averaging balances short/long response distributions.

6. Performance Scaling and Empirical Outcomes

MindSpeed-MLLM demonstrates efficient scaling and throughput:

  • On 1024 Ascend 910B devices:
    • MFU ≈ 40% on both 8B and 32B models.
    • For the 8B model with 8192-token sequences and batch size 1024, end-to-end throughput achieves substantial sample and token rates (exact figures withheld in the data).
    • Near-linear strong scaling to 512 cards; weak scaling efficiency >90% up to 1024 cards.
  • Relative to baselines (MindSpeed-MM or LlamaFactory + CANN):
    • 1.8×–2.2× faster end-to-end training throughput.
    • Convergence-matched loss curves (within 1% MAE/MRE of baseline curves).

These outcomes confirm the effectiveness of MindSpeed-MLLM in large-scale multimodal training on Ascend hardware (Chen et al., 15 Sep 2025).

7. Advanced Evaluation and Model Optimization Techniques

Two advanced post-training procedures further optimize MindVL and related models:

  • Test-time Resolution Search: A grid search over (mmin,mmax)(m_{\min}, m_{\max}) (ranges of image pixel counts, e.g., mmin{4,16,32,64}×282m_{\min}\in \{4,16,32,64\}\times28^2, mmax{1280,2048,2560,3072,4096,8192}×282m_{\max}\in\{1280,2048,2560,3072,4096,8192\}\times28^2):

    • Images are upsampled if << mminm_{\min} or downsampled if >> mmaxm_{\max}.
    • For each pair, validation accuracy A(mmin,mmax)A(m_{\min}, m_{\max}) is measured.
    • The optimal range is selected as:

    (mmin,mmax)=argmaxmmin,mmax1Ni=1N1{y^i(mmin,mmax)=yi}(m_{\min}^*, m_{\max}^*) = \arg\max_{m_{\min}, m_{\max}}\,\frac{1}{N}\sum_{i=1}^N \mathbf{1}\{ \hat y_i(m_{\min}, m_{\max}) = y_i \}

  • Model Weight Averaging: Training checkpoints from SFT runs at sequence lengths {2k,4k,8k}\ell \in \{2k, 4k, 8k\} are averaged:

θmerged=13θ()\theta_{\text{merged}} = \frac{1}{3}\sum_{\ell} \theta^{(\ell)}

Optionally, exponential moving average (EMA) is applied:

θEMA(t)=αθEMA(t1)+(1α)θ(t),α[0,1]\theta_{\mathrm{EMA}}^{(t)} = \alpha\,\theta_{\mathrm{EMA}}^{(t-1)} + (1-\alpha)\,\theta^{(t)},\quad \alpha \in [0,1]

These strategies refine generalization performance and accuracy consistency across diverse input resolutions and sequence lengths.


The MindSpeed-MLLM framework synthesizes advanced distributed multimodal optimization, precision alignment, hardware-specific operator equivalence, and high-efficiency data manipulation. These features collectively underpin data- and compute-efficient training of large vision-LLMs on Ascend NPUs, establishing strong parity or improvement over contemporary GPU-centric pipelines while achieving comparable accuracy with a fraction of the training data (Chen et al., 15 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MindSpeed-MLLM.