Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 85 tok/s
Gemini 2.5 Flash 160 tok/s Pro
Gemini 2.5 Pro 54 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

D-MoLE: Continual Multimodal Instruction Tuning

Updated 16 November 2025
  • Continual Multimodal Instruction Tuning is a framework that incrementally adapts multimodal language models to sequential tasks while mitigating catastrophic forgetting.
  • D-MoLE employs dynamic, budget-aware allocation of LoRA experts to resolve task architecture conflicts and modality imbalances.
  • Empirical evaluations demonstrate significant improvements in average accuracy and retention, outperforming fixed-parameter methods.

Continual Multimodal Instruction Tuning (CMIT) addresses the challenge of adapting Multimodal LLMs (MLLMs) to evolving sequences of vision-language tasks, aiming to incrementally acquire new capabilities while retaining prior knowledge. The Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) framework (Ge et al., 13 Jun 2025) introduces a dynamic, architecture-evolving approach to CMIT that resolves enduring issues of task architecture conflict and modality imbalance through budget-aware expert allocation, inter-modal curricula, and conditional expert routing.

1. Problem Formulation and Technical Challenges

Continual multimodal instruction tuning extends traditional (joint) multimodal instruction tuning by exposing an MLLM to tasks T1,T2,,TT\mathcal{T}_1,\mathcal{T}_2,\ldots,\mathcal{T}_T sequentially, rather than jointly. At each stage tt, the model is required to minimize the new-task loss

Ltask(θ)=E(x,y)Dt[(fθ(x),y)]\mathcal{L}_{\rm task}(\theta) = \mathbb{E}_{(x,y)\in \mathcal{D}_t}[\ell(f_\theta(x),y)]

subject to retention of performance on preceding tasks, i.e., mitigating catastrophic forgetting.

D-MoLE targets two key issues that arise in this context:

  • Task Architecture Conflict: Different tasks induce layer-wise discrepancies in gradient magnitudes; uniform parameter adaptation (such as inserting identical LoRA modules at every transformer block) is suboptimal for budget utilization, as some layers are insensitive to new tasks while others require substantial adaptation. Formally, for tasks TA\mathcal{T}_A and TB\mathcal{T}_B, there exists ll^* such that

ETA[WlL]2ETB[WlL]2.\left\| \mathbb{E}_{\mathcal{T}_A}[\nabla_{W_{l^*}} \mathcal{L}] \right\|_2 \ne \left\| \mathbb{E}_{\mathcal{T}_B}[\nabla_{W_{l^*}} \mathcal{L}] \right\|_2.

  • Modality Imbalance: In multimodal tasks, the adaptation difficulty associated with the vision and language components can differ substantially from task to task. Fixed parameter splits across modalities lead to inefficient adaptation, overfitting some components while underfitting others.

2. D-MoLE Architecture and Dynamic Expert Allocation

D-MoLE introduces a mechanism for dynamic allocation of LoRA-based adapters ("experts") tailored to each task and layer, subject to a total parameter budget BtotalB_{\rm total}. The architectural formulation is:

  • For each transformer layer ll, the forward pass for task tt is

flt(x)=Wl0x+k=1t1Ilkglk(x)ΔWlkx+IltΔWltx,f_l^t(x) = W_l^0 x + \sum_{k=1}^{t-1} I_l^k g_l^k(x) \Delta W_l^k x + I_l^t \Delta W_l^t x,

where: - Wl0W_l^0 is the frozen pretrained weight, - Ilk{0,1}I_l^k \in \{0,1\} indicates allocation of a task-kk LoRA expert at layer ll, - glk(x)g_l^k(x) is a dynamic gate determining if the kk-th expert is active for input xx, - ΔWlk\Delta W_l^k is the low-rank update with rank rr for task kk.

  • Layer Sensitivity Estimation: Prior to adaptation, a "zero-cost proxy" selects 1%1\% of task samples to compute initial gradient norms

G(l,t)=Wl0L(Dsub,t)2G(l,t) = \left\| \nabla_{W_l^0} \mathcal{L}(\mathcal{D}_{\rm sub, t}) \right\|_2

for each layer ll, quantifying its sensitivity to the new task.

  • Module (Modality) Difficulty: The cumulative gradient norm per modality (LLM, Vision) defines a difficulty score:

ScoreMt=WM0L(Dsub,t)2,M{LLM,Vision}.\operatorname{Score}_M^t = \left\| \nabla_{W_M^0} \mathcal{L}(\mathcal{D}_{\rm sub, t}) \right\|_2, \quad M \in\{ \rm LLM, Vision \}.

The available parameter budget BtotalB_{\rm total} is split by the ratio rMt=ScoreMtScoreLLMt+ScoreVisiontr_M^t = \frac{\operatorname{Score}_M^t}{\operatorname{Score}_{{\rm LLM}}^t+\operatorname{Score}_{{\rm Vision}}^t}, so each modality receives BMt=rMtBtotalB_M^t = r_M^t B_{\rm total}.

  • Sparse Layer Allocation: Within each module, top BMtB_M^t layers by G(l,t)G(l, t) are selected for new expert insertion. This results in a sparse, task-specific LoRA pattern, concentrating adaptation capacity where needed according to actual task demands.

3. Expert Routing and Knowledge Transfer

D-MoLE enhances knowledge retention and transfer using dynamic routing via per-task autoencoders:

  • For each task kk, a 2-layer MLP autoencoder is trained on pooled multimodal features zz to minimize the reconstruction loss Lreck(z)=zz^k22\mathcal{L}_{\rm rec}^k(z) = \|z-\hat{z}^k\|_2^2.
  • At training and inference, for given input xx, the reconstruction losses of all previous task autoencoders on zz are computed; the autoencoders with lowest reconstruction errors are deemed relevant.
  • The router activates top-2 most relevant past task experts (as determined by lowest Lreck\mathcal{L}_{\rm rec}^k) via glk(x)g_l^k(x) for soft knowledge reuse. New task experts (ΔWlt\Delta W_l^t) are always active at their allocated layers.

This routing mechanism allows D-MoLE to preserve and selectively transfer knowledge in a data-driven, input-adaptive fashion without explicit regularization or rehearsal.

4. Gradient-Based Inter-Modal Continual Curriculum

D-MoLE integrates a curriculum strategy to automatically adapt the proportion of adaptation resources allocated to each modality for each new task:

  • Measurement: For a new task tt, compute ScoreLLMt\operatorname{Score}_{{\rm LLM}}^t and ScoreVisiont\operatorname{Score}_{{\rm Vision}}^t using partial gradients on initial task data.
  • Curriculum Schedule: The proportion of budget per modality is set by the ratio of these scores, and the allocated experts are then distributed to the most sensitive layers within each modality as described previously.

The resulting approach ensures that subtasks with higher adaptation difficulty—objectively measured by gradient magnitudes—receive commensurately more adaptation resources, addressing the modality imbalance problem across diverse continual tasks.

5. Overall Training Objective and Resource Considerations

At each stage, only the newly introduced LoRA parameters at selected layers (and the task-specific autoencoder) are optimized: min{Alt,Blt},AEt  Ltask(Dt)+λLrects.t. Ilt{0,1},  lIltBtotal.\min_{\{\mathbf{A}_l^t, \mathbf{B}_l^t\}, AE^t}\; \mathcal{L}_{\rm task}(\mathcal{D}_t) + \lambda\,\mathcal{L}_{\rm rec}^t \quad \text{s.t. } I_l^t\in\{0,1\},\; \sum_l I_l^t\le B_{\rm total}. The base model and past-task LoRA parameters are frozen. No explicit regularizer (such as EWC) or experience replay is required, as task-specific experts and routing implicitly retain prior knowledge.

Implementation details include LoRA rank 8 per expert, Btotal=24B_{\rm total}=24 across 48 layers, 1% data usage for sensitivity proxies, and 2-layer MLP autoencoders with hidden size 128.

6. Empirical Performance and Comparative Analysis

D-MoLE was evaluated on a nine-task CMIT benchmark (5 VQA, 3 captioning, 1 grounding) with InternVL2-2B as backbone. Key metrics are averaged over tasks:

Method AVG Last BWT
O-LoRA 58.8% 62.0% −21.3%
D-MoLE 73.9% 82.2% −1.5%

Here, AVG denotes mean accuracy/CIDEr/IoU over all tasks after each stage, Last is performance on each task after the final stage, and BWT (Backward Transfer) quantifies forgetting (less negative is better). D-MoLE thus achieves a +15.1pp gain in AVG, +20.2pp in Last, and nearly eliminates forgetting relative to baselines, while being computationally efficient due to sparse backpropagation through selected LoRA inserts.

Budget usage visualizations demonstrate that D-MoLE’s allocation patterns adaptively concentrate experts in distinct layers per task. The router effectively gates both the expert added for the current task and the related expert(s) from prior tasks, suggesting successful data-driven knowledge sharing.

7. Theoretical and Practical Significance

D-MoLE (Ge et al., 13 Jun 2025) is notable as the first approach to continual multimodal instruction tuning that explores architectural dynamism under strict parameter budgets. It contrasts with methods such as LLaVA-c (Liu et al., 10 Jun 2025), which work within a fixed shared-parameter regime and employ techniques like spectral-aware consolidation and unsupervised inquiry regularization. Whereas LLaVA-c consolidates and regularizes a unified parameter core, D-MoLE evolves the model architecture itself, introducing sparse, task- and modality-adaptive experts and conditional reuse.

A plausible implication is that architectural approaches like D-MoLE may be increasingly favored as the number and diversity of tasks facing MLLMs only grows, particularly for scenarios in which fixed-parameter, core-centric retention methods encounter diminishing returns in capacity or run into fundamental coverage limitations. Direct head-to-head benchmarking on identical task suites remains an open direction.

8. Limitations and Future Directions

D-MoLE currently evaluates LoRA insertion with a fixed budget and relies on gradient magnitude as the sole proxy for adaptation need. Thresholds for router autoencoders are chosen based on in-task distribution; while performance is reported as robust to scaling, universality across model signatures and more diverse modalities (e.g., video, audio) is untested.

Future research may extend D-MoLE’s dynamic expert management to larger backbones, lifelong task streams, and richer inter-task relationships. Architecturally, avenues include more expressive routing (e.g., transformer-based routers) or mixture-of-experts approaches that weight soft rather than binary expert participation. The interaction of curriculum and expert allocation may also benefit from finer-grained, potentially task-driven annealing or adaptive schedule learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Continual Multimodal Instruction Tuning (D-MoLE).