2000 character limit reached

D-MoLE: Continual Multimodal Instruction Tuning

Updated 16 November 2025

Continual Multimodal Instruction Tuning is a framework that incrementally adapts multimodal language models to sequential tasks while mitigating catastrophic forgetting.
D-MoLE employs dynamic, budget-aware allocation of LoRA experts to resolve task architecture conflicts and modality imbalances.
Empirical evaluations demonstrate significant improvements in average accuracy and retention, outperforming fixed-parameter methods.

Continual Multimodal Instruction Tuning (CMIT) addresses the challenge of adapting Multimodal LLMs (MLLMs) to evolving sequences of vision-language tasks, aiming to incrementally acquire new capabilities while retaining prior knowledge. The Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) framework (Ge et al., 13 Jun 2025) introduces a dynamic, architecture-evolving approach to CMIT that resolves enduring issues of task architecture conflict and modality imbalance through budget-aware expert allocation, inter-modal curricula, and conditional expert routing.

1. Problem Formulation and Technical Challenges

Continual multimodal instruction tuning extends traditional (joint) multimodal instruction tuning by exposing an MLLM to tasks $\mathcal{T}_1,\mathcal{T}_2,\ldots,\mathcal{T}_T$ sequentially, rather than jointly. At each stage $t$ , the model is required to minimize the new-task loss

$\mathcal{L}_{\rm task}(\theta) = \mathbb{E}_{(x,y)\in \mathcal{D}_t}[\ell(f_\theta(x),y)]$

subject to retention of performance on preceding tasks, i.e., mitigating catastrophic forgetting.

D-MoLE targets two key issues that arise in this context:

Task Architecture Conflict: Different tasks induce layer-wise discrepancies in gradient magnitudes; uniform parameter adaptation (such as inserting identical LoRA modules at every transformer block) is suboptimal for budget utilization, as some layers are insensitive to new tasks while others require substantial adaptation. Formally, for tasks $\mathcal{T}_A$ and $\mathcal{T}_B$ , there exists $l^*$ such that

$\left\| \mathbb{E}_{\mathcal{T}_A}[\nabla_{W_{l^*}} \mathcal{L}] \right\|_2 \ne \left\| \mathbb{E}_{\mathcal{T}_B}[\nabla_{W_{l^*}} \mathcal{L}] \right\|_2.$

Modality Imbalance: In multimodal tasks, the adaptation difficulty associated with the vision and language components can differ substantially from task to task. Fixed parameter splits across modalities lead to inefficient adaptation, overfitting some components while underfitting others.

2. D-MoLE Architecture and Dynamic Expert Allocation

D-MoLE introduces a mechanism for dynamic allocation of LoRA-based adapters ("experts") tailored to each task and layer, subject to a total parameter budget $B_{\rm total}$ . The architectural formulation is:

For each transformer layer $l$ , the forward pass for task $t$ is

$f_l^t(x) = W_l^0 x + \sum_{k=1}^{t-1} I_l^k g_l^k(x) \Delta W_l^k x + I_l^t \Delta W_l^t x,$

where: - $W_l^0$ is the frozen pretrained weight, - $I_l^k \in \{0,1\}$ indicates allocation of a task- $k$ LoRA expert at layer $l$ , - $g_l^k(x)$ is a dynamic gate determining if the $k$ -th expert is active for input $x$ , - $\Delta W_l^k$ is the low-rank update with rank $r$ for task $k$ .

Layer Sensitivity Estimation: Prior to adaptation, a "zero-cost proxy" selects $1\%$ of task samples to compute initial gradient norms

$G(l,t) = \left\| \nabla_{W_l^0} \mathcal{L}(\mathcal{D}_{\rm sub, t}) \right\|_2$

for each layer $l$ , quantifying its sensitivity to the new task.

Module (Modality) Difficulty: The cumulative gradient norm per modality (LLM, Vision) defines a difficulty score:

$\operatorname{Score}_M^t = \left\| \nabla_{W_M^0} \mathcal{L}(\mathcal{D}_{\rm sub, t}) \right\|_2, \quad M \in\{ \rm LLM, Vision \}.$

The available parameter budget $B_{\rm total}$ is split by the ratio $r_M^t = \frac{\operatorname{Score}_M^t}{\operatorname{Score}_{{\rm LLM}}^t+\operatorname{Score}_{{\rm Vision}}^t}$ , so each modality receives $B_M^t = r_M^t B_{\rm total}$ .

Sparse Layer Allocation: Within each module, top $B_M^t$ layers by $G(l, t)$ are selected for new expert insertion. This results in a sparse, task-specific LoRA pattern, concentrating adaptation capacity where needed according to actual task demands.

3. Expert Routing and Knowledge Transfer

D-MoLE enhances knowledge retention and transfer using dynamic routing via per-task autoencoders:

For each task $k$ , a 2-layer MLP autoencoder is trained on pooled multimodal features $z$ to minimize the reconstruction loss $\mathcal{L}_{\rm rec}^k(z) = \|z-\hat{z}^k\|_2^2$ .
At training and inference, for given input $x$ , the reconstruction losses of all previous task autoencoders on $z$ are computed; the autoencoders with lowest reconstruction errors are deemed relevant.
The router activates top-2 most relevant past task experts (as determined by lowest $\mathcal{L}_{\rm rec}^k$ ) via $g_l^k(x)$ for soft knowledge reuse. New task experts ( $\Delta W_l^t$ ) are always active at their allocated layers.

This routing mechanism allows D-MoLE to preserve and selectively transfer knowledge in a data-driven, input-adaptive fashion without explicit regularization or rehearsal.

D-MoLE integrates a curriculum strategy to automatically adapt the proportion of adaptation resources allocated to each modality for each new task:

Measurement: For a new task $t$ , compute $\operatorname{Score}_{{\rm LLM}}^t$ and $\operatorname{Score}_{{\rm Vision}}^t$ using partial gradients on initial task data.
Curriculum Schedule: The proportion of budget per modality is set by the ratio of these scores, and the allocated experts are then distributed to the most sensitive layers within each modality as described previously.

The resulting approach ensures that subtasks with higher adaptation difficulty—objectively measured by gradient magnitudes—receive commensurately more adaptation resources, addressing the modality imbalance problem across diverse continual tasks.

5. Overall Training Objective and Resource Considerations

At each stage, only the newly introduced LoRA parameters at selected layers (and the task-specific autoencoder) are optimized: $\min_{\{\mathbf{A}_l^t, \mathbf{B}_l^t\}, AE^t}\; \mathcal{L}_{\rm task}(\mathcal{D}_t) + \lambda\,\mathcal{L}_{\rm rec}^t \quad \text{s.t. } I_l^t\in\{0,1\},\; \sum_l I_l^t\le B_{\rm total}.$ The base model and past-task LoRA parameters are frozen. No explicit regularizer (such as EWC) or experience replay is required, as task-specific experts and routing implicitly retain prior knowledge.

Implementation details include LoRA rank 8 per expert, $B_{\rm total}=24$ across 48 layers, 1% data usage for sensitivity proxies, and 2-layer MLP autoencoders with hidden size 128.

6. Empirical Performance and Comparative Analysis

D-MoLE was evaluated on a nine-task CMIT benchmark (5 VQA, 3 captioning, 1 grounding) with InternVL2-2B as backbone. Key metrics are averaged over tasks:

Method	AVG	Last	BWT
O-LoRA	58.8%	62.0%	−21.3%
D-MoLE	73.9%	82.2%	−1.5%

Here, AVG denotes mean accuracy/CIDEr/IoU over all tasks after each stage, Last is performance on each task after the final stage, and BWT (Backward Transfer) quantifies forgetting (less negative is better). D-MoLE thus achieves a +15.1pp gain in AVG, +20.2pp in Last, and nearly eliminates forgetting relative to baselines, while being computationally efficient due to sparse backpropagation through selected LoRA inserts.

Budget usage visualizations demonstrate that D-MoLE’s allocation patterns adaptively concentrate experts in distinct layers per task. The router effectively gates both the expert added for the current task and the related expert(s) from prior tasks, suggesting successful data-driven knowledge sharing.

7. Theoretical and Practical Significance

D-MoLE (Ge et al., 13 Jun 2025) is notable as the first approach to continual multimodal instruction tuning that explores architectural dynamism under strict parameter budgets. It contrasts with methods such as LLaVA-c (Liu et al., 10 Jun 2025), which work within a fixed shared-parameter regime and employ techniques like spectral-aware consolidation and unsupervised inquiry regularization. Whereas LLaVA-c consolidates and regularizes a unified parameter core, D-MoLE evolves the model architecture itself, introducing sparse, task- and modality-adaptive experts and conditional reuse.

A plausible implication is that architectural approaches like D-MoLE may be increasingly favored as the number and diversity of tasks facing MLLMs only grows, particularly for scenarios in which fixed-parameter, core-centric retention methods encounter diminishing returns in capacity or run into fundamental coverage limitations. Direct head-to-head benchmarking on identical task suites remains an open direction.

8. Limitations and Future Directions

D-MoLE currently evaluates LoRA insertion with a fixed budget and relies on gradient magnitude as the sole proxy for adaptation need. Thresholds for router autoencoders are chosen based on in-task distribution; while performance is reported as robust to scaling, universality across model signatures and more diverse modalities (e.g., video, audio) is untested.

Future research may extend D-MoLE’s dynamic expert management to larger backbones, lifelong task streams, and richer inter-task relationships. Architecturally, avenues include more expressive routing (e.g., transformer-based routers) or mixture-of-experts approaches that weight soft rather than binary expert participation. The interaction of curriculum and expert allocation may also benefit from finer-grained, potentially task-driven annealing or adaptive schedule learning.

PDF Markdown Chat (Pro)

References (2)

Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning (2025)

LLaVA-c: Continual Improved Visual Instruction Tuning (2025)

Follow Topic

Get notified by email when new papers are published related to Continual Multimodal Instruction Tuning (D-MoLE).