Papers
Topics
Authors
Recent
Search
2000 character limit reached

BranchLoRA: Efficient MoE-LoRA for MCIT

Updated 25 February 2026
  • BranchLoRA is an asymmetric Mixture-of-Experts Low-Rank Adaptation architecture that employs a shared trunk and specialized branch matrices to address parameter inefficiency.
  • It uses task-specific routers and a top-k expert selection mechanism to optimize parameter utilization in multimodal continual instruction tuning.
  • The design integrates flexible tuning-freezing to preserve previously learned knowledge while effectively adapting to new tasks.

BranchLoRA is an asymmetric Mixture-of-Experts (MoE) Low-Rank Adaptation (LoRA) architecture designed to enhance parameter efficiency and mitigate catastrophic forgetting in Multimodal Continual Instruction Tuning (MCIT) for Multimodal LLMs (MLLMs). By sharing a single "trunk" matrix across all tasks while maintaining specialized "branch" matrices for task subsets, BranchLoRA overcomes parameter inefficiency observed in previous MoE-LoRA approaches and introduces novel mechanisms such as flexible tuning-freezing, task-specific routers, and automatic task selection to support continual learning across diverse sequential vision–language tasks (Zhang et al., 31 May 2025).

1. Multimodal Continual Instruction Tuning and Motivation

MCIT addresses the challenge of incrementally adapting a frozen MLLM to a series of new multimodal tasks t1,,tTt_1,\dots,t_T, each defined by a natural language instruction, a training set, and a test set. The objective is to produce a single model M\mathcal{M} that, after sequentially tuning on data from Dt1train,,DtTtrain\mathcal{D}_{t_1}^{\mathrm{train}},\dots,\mathcal{D}_{t_T}^{\mathrm{train}}, retains high performance on every Dtjtest\mathcal{D}_{t_j}^{\mathrm{test}} for jTj\leq T.

The MoE-LoRA framework has been widely adopted in MCIT, where each feed-forward (FF) layer’s weight update is decomposed into NN independent low-rank LoRA adapters ("experts"), and a router determines the mixture allocation per input. Despite its advantages in isolating task-specific parameters and partially preventing catastrophic forgetting (CF), empirical analysis reveals that in MoE-LoRA, expert input matrices (Aj\bm{A}_j) tend to collapse into a shared subspace while output matrices (Bj\bm{B}_j) capture task diversity. This redundancy results in parameter inefficiency, as each task is unnecessarily allocated a full rank-r/Nr/N factor for both A\bm{A} and B\bm{B}, although only B\bm{B} genuinely reflects per-task variation.

2. Architecture and Mechanisms of BranchLoRA

BranchLoRA rectifies the architectural asymmetry inherent in MoE-LoRA. It introduces the following key design changes:

  • Shared trunk matrix: A single matrix ARdin×r/N\bm{A}\in\mathbb{R}^{d_{\mathrm{in}}\times r/N} is shared across all tasks, capturing task-invariant computation.
  • Branch matrices: Maintains NN task-specialized matrices {B1,,BN}\{\bm{B}_1,\ldots,\bm{B}_N\}, each BjRr/N×dout\bm{B}_j\in\mathbb{R}^{r/N\times d_{\mathrm{out}}}, which specialize to subsets of tasks.
  • Per-task routers: For each task tit_i, a task-specific router parameter Wrti\bm{W}_r^{t_i} dispatches activations to the most relevant branches, producing a sparse distribution via softmax over top-kk pre-activations.
  • Top-kk expert selection: Only the kk branches with highest scores participate per input, increasing parameter utilization and specialization.

The forward computation at each FF layer is

h=xWf+αrjTopkRti(x)j(xABj)\bm{h} = \bm{x} \bm{W}_f + \frac{\alpha}{r}\sum_{j\in \mathrm{Top}\,k} R^{t_i}(\bm{x})_j (\bm{x} \bm{A} \bm{B}_j)

where Wf\bm{W}_f is the frozen baseline weight, rr is the LoRA rank, and Rti(x)R^{t_i}(\bm{x}) is the routing distribution specific to task tit_i.

This asymmetric structure advances parameter efficiency without sacrificing the expressiveness required to learn task-specific behaviors.

3. Flexible Tuning–Freezing and Branch Specialization

To further counteract catastrophic forgetting, BranchLoRA employs a tuning-freezing mechanism. After training on task tit_i, activation statistics from the router over Dtitrain\mathcal{D}_{t_i}^{\mathrm{train}} identify the top-kk most-utilized branches (Bj\bm{B}_j), which are then frozen for subsequent tasks. This ensures that knowledge pertinent to previous tasks is maintained, while allowing unfrozen branches to remain adaptable.

During training on ti+1t_{i+1}, the update rule enforces

Bj={0,if jFrozen standard gradient,otherwise\nabla_{\bm{B}_j}= \begin{cases} 0, & \text{if } j\in \text{Frozen} \ \text{standard gradient}, & \text{otherwise} \end{cases}

while A\bm{A} and Wrti+1\bm{W}_r^{t_{i+1}} remain fully trainable. This mechanism facilitates both stability of past knowledge and adaptability to new tasks, and enables selective reactivation for knowledge transfer across tasks.

4. Inference-Time Task Selection and Automatic Routing

At inference, BranchLoRA operates without access to explicit task identity. To address this, it learns, for each task tit_i, paired keys kimgti\bm{k}_{\mathrm{img}}^{t_i} and ktxtti\bm{k}_{\mathrm{txt}}^{t_i} that are aligned to the image and text modality embeddings using a cosine-alignment loss

Lalign=j[(1Cos(ej,imgti,kimgti))+(1Cos(ej,txtti,ktxtti))]\mathcal{L}_{\mathrm{align}} = \sum_{j} \left[(1-\mathrm{Cos}(\bm{e}_{j,\mathrm{img}}^{t_i},\bm{k}_{\mathrm{img}}^{t_i})) + (1-\mathrm{Cos}(\bm{e}_{j,\mathrm{txt}}^{t_i},\bm{k}_{\mathrm{txt}}^{t_i}))\right]

Given a test sample’s image and text embeddings (eimg,etxt)(\bm{e}_{\mathrm{img}},\bm{e}_{\mathrm{txt}}), the inference pipeline computes

i=argmaxi[Cos(eimg,kimgti)+Cos(etxt,ktxtti)]i^* = \arg\max_{i} \left[\mathrm{Cos}(\bm{e}_{\mathrm{img}}, \bm{k}_{\mathrm{img}}^{t_i}) + \mathrm{Cos}(\bm{e}_{\mathrm{txt}}, \bm{k}_{\mathrm{txt}}^{t_i}) \right]

and routes the input through the corresponding task-specific router Wrti\bm{W}_r^{t_{i^*}}, selecting its top-kk branches for the remainder of the computation. This automatic routing obviates the need for explicit task labels or oracle knowledge at inference.

5. Experimental Evaluation

BranchLoRA was empirically benchmarked on the CoIN continual-instruction-tuning suite, comprising eight sequential multimodal datasets: ScienceQA, TextVQA, ImageNet classification, GQA, VizWiz, RefCOCO family grounding, VQAv2, and OCR-VQA. Using LLaVA-1.5 as the backbone model (7B and 13B parameter variants), with LoRA rank r=128r=128, N=8N=8 experts, and k=2k=2 active branches per layer, the following results were achieved:

Model ACC (%) MAA (%) BWT (%) Trainable Params Train Time/Batch (ms)
MoE-LoRA (7B) 37.13 42.76 -25.91 350M 62
BranchLoRA(7B) 44.20 49.94 -20.98 222M 51
MoE-LoRA (13B) 42.51
BranchLoRA(13B) 49.27

Where ACC is mean accuracy after all tasks, MAA is the mean accuracy over the trajectory, and BWT (backward transfer) measures forgetting (negative indicates loss). BranchLoRA reduced trainable LoRA parameters by approximately 37% and decreased training time per batch. Ablation studies confirmed that shared trunk design, dynamic sparse expert selection, tuning-freezing, and task-specific routers each contributed incrementally to performance, with the complete stack showing the strongest resistance to catastrophic forgetting (Zhang et al., 31 May 2025).

6. Implications, Limitations, and Future Directions

BranchLoRA exposes and exploits the representational asymmetry between input (A\bm{A}) and output (B\bm{B}) adapters, resulting in a more parameter-efficient architecture for sequential task adaptation. Empirical findings indicate up to 20% reduction in catastrophic forgetting and substantial improvement in continual-learning metrics relative to MoE-LoRA.

Current limitations include evaluation restricted to the CoIN benchmark and a fixed instruction-tuning regime. Areas for further exploration include extension to non-multimodal settings, application of advanced model-merging techniques, and validation across longer task sequences (Zhang et al., 31 May 2025). A plausible implication is that such architectural asymmetries may generalize to other domains within continual learning and adapter-based transfer learning frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BranchLoRA.