Papers
Topics
Authors
Recent
2000 character limit reached

How to Teach Large Multimodal Models New Skills (2510.08564v1)

Published 9 Oct 2025 in cs.AI, cs.CV, and cs.LG

Abstract: How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

Summary

  • The paper demonstrates that selective fine-tuning of self-attention and MLP components effectively adds new skills to LMMs while mitigating catastrophic forgetting.
  • It shows that tuning only SA Proj. or MLP Gate & Up components maintains performance on eight held-out benchmarks compared to full-model updates.
  • The study analyzes output distribution shifts and presents techniques like LwF, LoRA, and WiSE-FT to balance new learning with retention of prior capabilities.

Teaching Large Multimodal Models New Skills: Strategies for Effective Sequential Fine-Tuning

Introduction

The research on "How to Teach Large Multimodal Models New Skills" tackles the challenge of enhancing Large Multimodal Models (LMMs), such as LLaVA and Qwen2.5-VL, with new capabilities while preserving existing knowledge. Sequential fine-tuning often leads to catastrophic forgetting, where a model's proficiency on previously learned tasks diminishes. This paper evaluates fine-tuning methodologies on various components of LMMs, identifying strategies that enable effective learning without significant performance degradation on held-out tasks.

Methodologies and Findings

Component-Level Tuning

The paper explores multiple component-specific tuning strategies for LMMs, aiming to enhance target task performance while minimizing forgetting. Key findings include:

  • Self-Attention Projections (SA Proj.): This approach achieves significant learning on new tasks with negligible forgetting. Updating only SA Proj. components maintained performance on eight held-out benchmarks with minimal degradation.
  • MLP Gate & Up Tuning: By tuning MLP's Gate and Up projections while keeping the Down projection frozen, this method provides excellent target task gains and limited forgetting. This balance is due to reduced bias in output token distribution, which preserves existing model capabilities.
  • Full-Model Tuning: Although this approach results in substantial learning improvements, it leads to severe forgetting on both target and held-out tasks. As per metrics, held-out performance sees the largest declines with full-model tuning.

The results emphasize the advantage of SA Proj. and MLP Gate&Up tuning strategies over traditional full-model fine-tuning, highlighting their ability to effectively update LMMs without erasing prior knowledge.

Output Distribution Analysis

An innovative aspect of this research is the examination of output distribution shifts during fine-tuning. The paper employs a counting-bias probe to reveal that tuning MLP components tends to increase the likelihood of emitting numeric tokens, correlating with a decrease in held-out task performance. In contrast, SA Proj. maintains a stable output distribution, minimizing shifts and interference.

Forgetting Mitigation Techniques

Various strategies were explored to mitigate forgetting in LMMs, including:

  • Learning Without Forgetting (LwF): This involves distilling model knowledge from prior checkpoints, significantly reducing output shift-driven forgetting during MLP tuning.
  • LoRA and WiSE-FT: Parameter-efficient methods like LoRA reduce learning and forgetting in LMMs but at the cost of additional complexities. WiSE-FT provides a reliable balance by blending base and fine-tuned weights.

The experiments consistently demonstrate that selective tuning (SA Proj. or MLP Gate&Up) yields results comparable to or better than these specialized mitigation strategies.

Implications and Future Work

The findings from this research offer pathways for continually enhancing LMMs' functionalities efficiently. By focusing on component-specific tuning rather than wholesale model updates, it's possible to significantly reduce the environmental and financial burden of retraining large models—a growing concern as LLM capabilities expand.

Future research avenues include exploring alternative architectures that could benefit from these findings and extending these methodologies to include other modalities like audio processing. Additionally, understanding how privacy and security concerns intersect with sequential fine-tuning remains an open question requiring further investigation.

Conclusion

The paper effectively documents and addresses the complexities of teaching LMMs new skills by minimizing forgetting through selective component tuning. By analyzing output distribution shifts and leveraging different tuning recipes, this paper provides robust methodologies for augmenting LMM capabilities without compromising existing knowledge. Such insights are crucial as AI models continue to scale and diversify in their applications, offering a way forward for efficient and stable continual learning.

Whiteboard

Explain it Like I'm 14

What This Paper Is About (In Simple Terms)

This paper studies how to teach big AI models that see images and read text (called large multimodal models, or LMMs) new skills—like counting objects or reading clocks—without making them forget what they already know. The authors test different ways to fine-tune these models so they learn new tasks while keeping their general abilities.

Quick idea

Think of an LMM like a very smart student who can look at pictures, read questions, and write answers. The challenge: how do we teach the student a new subject (say, medical questions) without them forgetting other subjects (like reading charts or everyday image questions)?


What Questions Did the Paper Ask?

  • Which parts of the model should we update so it learns new skills well but doesn’t forget old ones?
  • Does “forgetting” really mean the model has lost knowledge—or is something else going on?
  • Can we reduce forgetting with simple, reliable training tricks?
  • Do these findings hold across different model families?

How They Tested It (Methods, Explained Simply)

The researchers trained and tested on real tasks that an LMM might face:

  • New skills to learn (trained one after another, like school classes):
    • Bird species recognition (fine-grained classification)
    • Counting objects in pictures
    • Medical visual Q&A
    • Reading text in images (OCR)
    • Reading analog clock time
  • General abilities held out for checking “forgetting”:
    • Eight well-known benchmarks (like charts, diagrams, documents, science questions, etc.)

They tried fine-tuning different parts of the model:

  • Vision side:
    • Vision encoder (turns image into features)
    • Projector (adapts vision features to the LLM)
  • Language side (the “brain” that decides what to say):
    • Self-attention “projection” layers (tell the model where to look and how to mix information)
    • MLP layers (like a memory that pushes the model toward certain words)
    • A special MLP setting: only update the “Gate & Up” parts and freeze the “Down” part (explained below)

They measured four things after training the five tasks in sequence:

  • Target learning: how much the model improves on the current new task
  • Target forgetting: how much it forgets previously learned target tasks by the end
  • Target overall: the final average across all five new tasks
  • Held-out forgetting: how much general ability changes on the eight benchmarks

They also used a simple probe to look for “output bias”:

  • While training on counting, they checked how often the model started preferring number words even on unrelated tasks. If the model starts saying numbers everywhere, that’s a sign its “output preferences” have shifted.

Optional extra (sometimes used): a “teacher” method (distillation) that nudges the new model to stay close to the previous version’s behavior, which can reduce forgetting.


Key Terms (Quick, Friendly Explanations)

  • Self-attention: like directing your eyes—deciding where to look in the text and image and how to combine those clues.
  • MLP (feed-forward layers): like the model’s memory of patterns that often lead to certain words; it strongly influences what words the model prefers to say next.
  • Output distribution shift: the model changes its “speaking style,” for example, becoming more likely to say numbers after being trained on counting—even when it shouldn’t.
  • “Gate & Up” vs. “Down” in the MLP:
    • Gate & Up: parts that detect and activate concepts (like “there are many objects”).
    • Down: the part that “writes” those activated concepts back into the model’s main stream and heavily affects the final word choices.

What They Found (Main Results)

  1. You don’t have to update everything to learn well.
    • Updating the LLM is crucial for learning new skills.
    • Updating the vision parts alone helps little and can even hurt general ability.
  2. Two simple tuning recipes work great with minimal forgetting:
    • Update only self-attention projection layers (SA Proj.):
      • Strong learning on new tasks
      • Very little forgetting on general tasks
    • Update only MLP Gate & Up while freezing MLP Down:
      • Almost as strong as full MLP learning
      • Much less forgetting than full MLP
  3. Why does forgetting happen? It’s often “style drift,” not true amnesia.
    • When training on counting, the model starts liking number words too much—even on tasks where numbers don’t belong.
    • This “number bias” rises when updating full MLPs and correlates with drops on general benchmarks.
    • Updating self-attention (SA Proj.) doesn’t push the model toward number words, so the general performance stays stable.
    • Because it’s a style shift, not erased knowledge, performance can partly recover later when the model is tuned on another, different new task.
  4. Simple ways to limit drift reduce forgetting.
    • Freezing MLP Down (and only tuning Gate & Up) limits how much the model’s word preferences shift.
    • Using a “teacher” loss (distillation to the previous checkpoint) also helps keep output preferences stable.
  5. The findings hold across different model families.
    • Tested on LLaVA-One Vision, LLaVA-NeXT, and Qwen2.5-VL.
    • Same big picture: SA Proj. and MLP Gate & Up give the best balance between learning new skills and keeping old ones.
  6. Compared to popular methods (LoRA, MoE, WiSE-FT, Learning without Forgetting):
    • The two simple recipes (SA Proj., MLP Gate & Up) match or outperform these methods in the learn–forget trade-off without extra modules or tricky settings.

Why This Matters

  • Training a brand-new large model is expensive and bad for the environment.
  • These results show we can efficiently add new skills to existing models without retraining from scratch and without wiping out what they already know.
  • The paper suggests a practical rule of thumb:
    • If you want stability: tune self-attention projections.
    • If you want near-maximum learning with low forgetting: tune MLP Gate & Up and freeze MLP Down.

Simple Takeaway

Teaching big vision–LLMs new tricks doesn’t have to cause “catastrophic forgetting.” Much of the problem is the model changing how it talks (its output preferences), not losing knowledge. By carefully choosing which parts to update—especially self-attention projections or only the “Gate & Up” parts of the MLP—we can help the model learn new skills while keeping its general abilities intact.


Possible Impact and Next Steps

  • Faster, cheaper, greener updates for AI models in specialized areas like medicine, counting, or OCR.
  • More reliable continual learning in real-world systems that must keep improving over time.
  • Future work: try longer training sequences, bigger models, more types of data (like audio), and paper safety, privacy, and societal effects.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and unresolved questions that future work could address:

  • Causality vs. correlation of output-distribution shift: Does reducing token-distribution drift causally reduce forgetting, or are both effects driven by a third factor? Design controlled interventions (e.g., direct logit calibration, synthetic bias injections) to establish causality.
  • Beyond counting bias: The numeric-token probe is task-specific. Develop general, task-agnostic drift monitors (e.g., token-class–level KL divergence to a base model across broad corpora) and task-specific probes for OCR (alpha-numeric tokens), medical VQA (domain terms), yes/no biases, etc.
  • Predictive monitoring: Can early drift indicators predict downstream forgetting reliably enough to trigger adaptive training decisions (e.g., swapping which layers are tunable, switching to distillation, or early stopping)?
  • Curriculum and order sensitivity: How do different task orders, interleaved schedules, or curriculum strategies affect both learning and the “recovery” phenomenon? Can we optimize task ordering using gradient alignment or interference measures?
  • Longer and more realistic continual streams: Do the findings (limited drift with SA Proj. or MLP Gate&Up and partial recovery) hold over much longer sequences (10–50+ tasks), with interleaved or non-stationary distributions?
  • Data-size imbalance effects: How do target dataset size, balance across stages, and sampling strategies influence output drift and forgetting (e.g., large PixmoCount vs. smaller TimeClock)?
  • Layerwise granularity: Which layers contribute most to learning vs. forgetting for each skill? Can adaptive, per-layer tuning schedules (e.g., only later MLP layers or mid-layer attention) further improve the learning–stability trade-off?
  • Head-level selectivity: Does tuning only a subset of attention heads (chosen by importance, sensitivity, or attribution) match SA Proj. performance while further reducing drift?
  • Role of Wdown and write-back: The hypothesis that Wdown chiefly drives output-distribution shift needs deeper mechanistic and causal validation (e.g., interventions that isolate Wdown effects, linear-logit-lens analyses across layers/tasks).
  • Frozen components assumptions: What is the effect of unfreezing the LM head U, token embeddings, or layer norms on learning and forgetting? Are there beneficial partial-unfreeze recipes beyond Gate&Up?
  • Objective design: How do alternative objectives (contrastive learning, DPO/RLHF, sequence-level rewards, label smoothing) affect drift and forgetting relative to next-token cross-entropy?
  • Distillation design space: What are best practices for LwF (temperature T, token-position subsampling S(y), λ weighting) to optimize the learning–stability trade-off across tasks and backbones?
  • Inference-time mitigation: Can post-hoc calibration (e.g., temperature scaling, class-conditional bias corrections, constrained decoding) mitigate drift-induced errors on held-out tasks without retraining?
  • Replay synergy: How do small exemplar buffers or synthetic replay (e.g., self-synthesized rehearsal) combine with SA Proj./Gate&Up to further reduce forgetting at minimal memory cost?
  • Parameter-efficient methods breadth: The paper applies LoRA/MoE primarily to MLP. Do attention-side LoRA, orthogonal LoRA, or hybrid adapters on SA Proj. match or surpass selective tuning baselines?
  • Cross-architecture generalization: Do findings hold for cross-attention–based LMMs, encoder–decoder architectures, Perceiver-style resamplers, or mixture-of-experts LMMs?
  • Scaling laws: How do the observed behaviors change with model scale (e.g., 1–3B, 13B, 70B+) and context length? Are larger models inherently more/less robust to output drift?
  • Tokenization and multilinguality: How do vocabulary size, subword schemes, and multilingual tokenization influence drift measurement and mitigation (e.g., numeric or script-specific biases across languages)?
  • Domain-shifted vision: For heavy visual domain shifts (e.g., medical, satellite, documents), can vision-side adapters or alternative projectors achieve target gains without harming general ability (contrary to current negative results for vision updates)?
  • Projector design: Are more expressive projector interfaces (e.g., Perceiver Resampler, learned visual prompts) a viable alternative to LM tuning for adding skills with minimal forgetting?
  • Video and temporal inputs: Do the SA Proj./Gate&Up recipes hold for video or multi-image tasks where temporal attention or memory is key?
  • Safety and alignment: How does selective tuning affect harmful content, stereotypes, and jailbreak susceptibility? Does output-distribution drift degrade safety alignment, and can mitigation preserve it?
  • Privacy risks: Does sequential fine-tuning increase membership-inference or data-extraction risks on target or held-out domains, and do distillation or constrained tuning reduce leakage?
  • Calibration and robustness: How do methods affect calibration (ECE, Brier), OOD robustness, and adversarial prompt resilience on held-out tasks?
  • Efficiency accounting: Quantify wall-clock time, tuned parameter counts, memory footprint, and energy/CO2 savings for selective tuning vs. full fine-tuning to substantiate efficiency claims.
  • Reproducibility and sensitivity: Provide hyperparameter sensitivity analyses (LR, batch size, schedulers, weight decay, steps), optimizer choices, and seed variability to assess robustness of conclusions.
  • Positive transfer vs. interference: Characterize when specialization helps held-out tasks (positive transfer) vs. hurts (interference), using gradient similarity or Fisher overlap to predict cross-task interactions.
  • Recovery conditions: Under what conditions does “recovery” of held-out performance occur after subsequent tasks? Can we proactively schedule tasks that counteract harmful drifts?
  • Evaluation granularity: Held-out performance is averaged across diverse benchmarks; per-task breakdowns and category-level drift analyses (e.g., reasoning vs. OCR vs. diagram) could reveal hidden regressions.
  • Metric alignment: ANLS and accuracy are combined; investigate whether metric heterogeneity masks specific error modes tied to distribution shifts.
  • Alternative decoding and scoring: Many evaluations rely on greedy decoding; test nucleus/beam decoding and constrained answer normalization to disentangle modeling vs. decoding effects on forgetting.
  • Concept retention probes: Beyond output distributions, probe intermediate representations (e.g., linear probes on residual stream) to verify whether “concepts” are preserved when forgetting appears.
  • Compositional generalization: After learning multiple skills, does the model compose them on novel combinations (e.g., count fine-grained categories with OCR)? Test compositional OOD benchmarks.
  • Data contamination checks: Rigorously audit training/held-out overlap across backbones to ensure observed recovery/forgetting is not confounded by pretraining exposure.

Glossary

  • Adapters: Lightweight trainable modules inserted into a pretrained network for parameter-efficient fine-tuning. "including full-model tuning, adapters, LoRA, and prompt tuning."
  • ANLS: Average Normalized Levenshtein Similarity; a soft string-matching metric in [0,1] often used for VQA evaluation. "InfoVQA and DocVQA use ANLS; since ANLS is in [0, 1], we average it with accuracies from the other held-out tasks when reporting the mean held-out score."
  • Catastrophic forgetting: The degradation of previously learned capabilities when a model is fine-tuned on new tasks. "fine-tuning is known to cause catastrophic forgetting, such that a model previously proficient on many tasks becomes a narrow expert on the new one."
  • Class-incremental learning: A continual learning setting where new classes are introduced over time without access to full past data. "Zhou et al. (2025) design task-specific projection layers and cross-modal fusion modules for vision-LLMs in class-incremental learning."
  • CLIP: A vision-LLM trained with contrastive learning to align images and texts. "Vision-text contrastive models, such as CLIP (Radford et al., 2021), are trained to align images and texts for open-vocabulary image classification and retrieval."
  • Continual learning: Training over a sequence of tasks while retaining prior knowledge. "Continual learning, also known as lifelong learning (Aljundi et al., 2017; Chen & Liu, 2018; Chaudhry et al., 2019), aims to train models on a sequence of tasks or data streams without forgetting previously acquired knowledge."
  • Counting-bias probe: A diagnostic that measures increased preference for numeric tokens to detect output-distribution shift during counting fine-tuning. "a simple counting-bias probe that identifies the shift co-varies with forgetting."
  • Decoder-only transformer: A transformer architecture composed solely of decoder blocks that autoregressively generate tokens. "The LLM is a pre-norm, decoder-only transformer with L identical blocks."
  • Elastic Weight Consolidation (EWC): A regularization method that constrains changes to parameters deemed important to previous tasks. "Xiang et al. (2023) propose to use regularization strategies such as Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) and hierarchical importance-based penalties"
  • Exemplar replay: Mitigating forgetting by storing and replaying a subset of samples from previous tasks during new training phases. "exemplar replay methods build a reservoir of samples from old training rounds ... and replay them in successive training phases as a way of recalling past knowledge;"
  • Induction heads: Attention heads that implement copying patterns enabling in-context learning. "analyses of 'induction heads,' which implement a simple copying algorithm and closely track the emergence of in-context learning during training (Olsson et al., 2022)."
  • KL-divergence: A measure of difference between probability distributions, used to align current outputs to a teacher model during distillation. "we can enforce a KL-divergence constraint between the outputs of the current model at stage k with a frozen teacher model"
  • Learning without Forgetting (LwF): A distillation-based training strategy to retain prior behavior without access to old data. "Learning-without-Forgetting (optional)."
  • LM head: The final linear mapping from hidden representations to vocabulary logits in a LLM. "We use U E Rdx|V| for the LM head and denote the final block output as r(L), so logits are z = UTr(L)."
  • LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning method that learns low-rank updates to weight matrices. "LoRA adapters are wrapped only on the MLP layers."
  • Mixture-of-Experts (MoE): An architecture that routes inputs to one or more expert subnetworks to improve capacity and specialization. "MoE is also applied to the MLP layers."
  • MLP (Gate&Up): A selective MLP-tuning scheme updating only the gate and up projections while freezing the down projection to limit write-back bias. "MLP (Gate&Up) delivers near-maximal Target Learning +30.5 and the highest Target Overall +27.1 while keeping forgetting small (Target -4.2, Held-out -2.1);"
  • Number-token bias (NTB): The tendency of a model to overproduce numeric tokens, used as a proxy for output-distribution shift during counting adaptation. "we track a simple number-token bias (NTB) during counting adaptation:"
  • Open-vocabulary image classification: Recognizing categories beyond a fixed label set by leveraging aligned image-text representations. "Vision-text contrastive models, such as CLIP (Radford et al., 2021), are trained to align images and texts for open-vocabulary image classification and retrieval."
  • Output token distribution shift: A change in the model’s next-token probabilities that can cause apparent forgetting without loss of underlying knowledge. "forgetting is highly related to output token distribution shift and methods that prevents shifts mitigate forgetting (Sec. 5.2 and 5.4);"
  • Pre-norm: A transformer variant applying layer normalization before attention and MLP sublayers, aiding optimization stability. "The LLM is a pre-norm, decoder-only transformer with L identical blocks."
  • Projector: The module that maps visual features to the LLM’s embedding width for joint processing. "The projector maps to the language representation width,"
  • Residual stream: The running hidden state updated by adding attention and MLP outputs across layers in a transformer. "Residual stream. Let Xtext ERSt xd be the text embeddings and Ivis ERSv xd the projected visual tokens."
  • Residual-to-logit analysis: An attribution analysis tracing how residual stream contributions affect final logits. "a layer-wise residual-to-logit analysis shows that most of the shift is written by late MLP blocks, not by self-attention."
  • Self-attention projection (SA Proj.): The set of attention projection matrices (WQ, WK, WV, WO) that control routing and write-back in attention. "SA Proj .: Update WQ, WK, Wv, Wo in all blocks (routing + write-back for attention)."
  • SiLU: Sigmoid Linear Unit; an activation function used for gated MLPs. "With input x and gating nonlinearity o = SiLU,"
  • Teacher forcing: Training technique that feeds ground-truth previous tokens to predict the next token in sequence modeling. "We use next-token cross-entropy on the current target dataset with teacher forcing."
  • Transformer Circuits framework: A mechanistic interpretability framework analyzing transformer components and their roles. "this view is formalized in the Transformer Circuits framework and supported by analyses of 'induction heads,'"
  • Visual tokens: Discrete embeddings produced by a vision encoder to represent images for a LLM. "which are converted to visual tokens by the vision encoder,"
  • Weight-space ensembling (WiSE-FT): Combining models by interpolating their weights to balance specialization and generalization. "with common forgetting mitigation methods: Learning without Forgetting (LwF) (Li & Hoiem, 2017), LoRA, Mixture-of-Experts (MoE), and weight-space ensembling (WiSE-FT) (Wortsman et al., 2022)."
  • Zero-shot performance: Performance on a task without any task-specific fine-tuning or training data. "Zheng et al. (2023) use knowledge distillation (Li & Hoiem, 2017) on CLIP to maintain zero-shot performance."

Practical Applications

The following bullet-point applications distill real-world uses that follow directly from the paper’s findings on selective layer tuning (self-attention projections and MLP Gate&Up), output-distribution shift diagnosis, and simple forgetting mitigations across three LMM families.

Immediate Applications

  • Selective fine-tuning to add niche skills without broad regressions
    • Sectors: healthcare (medical VQA on institutional data), retail/logistics (OCR on receipts, shelf/inventory counting), education (reading analogue clocks, diagram Q&A), finance (DocVQA, chart QA), robotics (object counting, signage OCR).
    • Tools/products/workflows: “Selective Tuning” SDK or training script that updates only SA Proj or MLP Gate&Up; curriculum-based sequential fine-tuning for five skills; deployment-ready adapters for LLaVA/LLaVA-NeXT/Qwen2.5-VL.
    • Assumptions/dependencies: access to model weights and licensing that permits fine-tuning; task data quality; architectures similar to LLaVA/Qwen2.5-VL; compute for short fine-tunes; domain evaluation to confirm held-out stability.
  • Output-distribution drift monitoring in MLOps
    • Sectors: software/MLOps for any organization updating multimodal assistants.
    • Tools/products/workflows: a “Drift Probe” that generalizes the counting-bias probe to monitor token-family likelihoods during/after fine-tuning; dashboards tracking the paper’s sequence-level metrics (Target Learning, Target Forgetting, Target Overall, Held-out Forgetting).
    • Assumptions/dependencies: representative held-out data, probe vocabulary sets per domain (e.g., numerals, dates, legal terms), logging of next-token distributions; thresholds calibrated to business impact.
  • Safer, carbon- and cost-efficient model adaptation
    • Sectors: energy/sustainability, IT procurement, enterprise AI.
    • Tools/products/workflows: training policy that defaults to SA Proj for high stability or MLP Gate&Up for strong learning with limited forgetting; avoidance of full-model tuning unless necessary; reporting reduced GPU hours and CO2 per skill added.
    • Assumptions/dependencies: measurement of energy usage; willingness to adopt layer-selective policies; comparable accuracy targets vs. full fine-tunes.
  • Continual learning playbooks for product teams
    • Sectors: software, robotics, customer support assistants.
    • Tools/products/workflows: stage-wise adaptation pipelines that (1) tune SA Proj for routing updates, (2) use MLP Gate&Up when output changes are acceptable, (3) optionally apply distillation-to-previous-checkpoint to curb drift; automatic rollbacks if held-out metrics dip.
    • Assumptions/dependencies: reliable held-out suites; versioned checkpoints; change-management process; data governance.
  • Compliance-ready adaptation for regulated domains
    • Sectors: healthcare, finance, public sector.
    • Tools/products/workflows: a “Forgetting Monitor” that logs output drift and held-out stability as part of change-control; documented recipes that freeze MLP Down during sensitive updates; audit trails of per-stage performance and drift metrics.
    • Assumptions/dependencies: regulatory acceptance of layer-selective tuning as a risk mitigation; privacy-preserving fine-tuning data pipelines.
  • Academic benchmarking and methodology adoption
    • Sectors: academia and research labs.
    • Tools/products/workflows: adoption of the paper’s task suite and sequence-level metrics; controlled component-level ablations (Vision encoder vs. projector vs. LLM; SA Proj vs. MLP Gate&Up); public replication on additional backbones.
    • Assumptions/dependencies: accessible datasets and code; comparable LMM architectures; compute availability.
  • Packaged “skill plug-ins” for LMMs
    • Sectors: software platforms, model marketplaces.
    • Tools/products/workflows: pre-trained SA Proj or MLP Gate&Up deltas for common skills (counting, OCR, time reading) deliverable as lightweight patches; simple integration APIs to apply patches per model family.
    • Assumptions/dependencies: compatibility across minor model versions; licensing for redistribution of tuned weights; validation sets for each plug-in.

Long-Term Applications

  • AutoML for selective layer tuning and drift control
    • Sectors: software/MLOps platforms.
    • Tools/products/workflows: an “Orchestrator” that automatically chooses SA Proj vs. MLP Gate&Up per task, tunes hyperparameters, and monitors/output drift; integrates optional distillation or weight ensembling when drift exceeds budget.
    • Assumptions/dependencies: robust meta-learning over many tasks/backbones; standardized drift budgets and policies; expanded probes beyond numerals.
  • Federated and on-device continual learning with minimal forgetting
    • Sectors: mobile, edge robotics, IoT.
    • Tools/products/workflows: on-device updates that prefer SA Proj for stability and minimal compute/memory; periodic server-side distillation to reconcile local updates; privacy-preserving local datasets.
    • Assumptions/dependencies: efficient edge training primitives; secure aggregation for federated setups; energy constraints; licensing for on-device tuning.
  • Cross-modal extension to audio/video and embodied agents
    • Sectors: media analytics, autonomous systems.
    • Tools/products/workflows: selective tuning of attention/MLP gate&up equivalents in audio/text and video encoders to add new multimodal skills (e.g., sound event counting, temporal OCR) without degrading general capabilities.
    • Assumptions/dependencies: architectural parity of selective-write components in new modalities; new drift probes for modality-specific tokens/features; larger-scale validation.
  • Standardized regulatory frameworks for adaptive models
    • Sectors: policy/governance.
    • Tools/products/workflows: guidelines mandating drift monitoring and held-out stability checks during model updates; reporting of energy/CO2 savings when using selective tuning; audit-ready performance dashboards.
    • Assumptions/dependencies: consensus on metrics and thresholds; sector-specific held-out suites; mechanisms for independent verification.
  • Enterprise lifecycle management for LMM skills
    • Sectors: enterprise AI operations.
    • Tools/products/workflows: “Skill Registry” tracking per-skill patches, dependencies, and drift profiles; automated regression testing across business-critical held-out suites; scheduled refresh to recover transient forgetting as subsequent tasks are added.
    • Assumptions/dependencies: integration with CI/CD and model registries; cost-benefit analyses; robust rollback and versioning.
  • Hardware/software co-design for selective tuning
    • Sectors: semiconductors, cloud providers.
    • Tools/products/workflows: kernels and memory layouts optimized for SA Proj/MLP Gate&Up updates; scheduler support for fast layer-selective training; cost-tiered cloud offerings for continual tuning workloads.
    • Assumptions/dependencies: demand for fine-grained training; model-architecture standardization; ecosystem support.
  • Research programs on output-distribution anchoring
    • Sectors: academia/industry labs.
    • Tools/products/workflows: new regularizers and objectives that explicitly anchor output distributions across tasks (beyond KL distillation), adaptive token-family priors, and layer-wise write-back control strategies.
    • Assumptions/dependencies: broader empirical validation across very large models; theoretical frameworks connecting routing vs. writing to generalization; shared benchmarks.
  • Sector-specific adaptive assistants with safety guardrails
    • Sectors: healthcare, finance, education, public safety.
    • Tools/products/workflows: assistants that accumulate highly specialized skills over time while enforcing drift bounds and safety checks; human-in-the-loop review for updates in critical domains; explainability via probe visualizations.
    • Assumptions/dependencies: robust safety tooling; vetted domain datasets; clear escalation paths for anomalous drift.

These applications hinge on specific findings of the paper: tuning only self-attention projections or only MLP Gate&Up yields strong target learning with minimal forgetting; forgetting often reflects output-distribution shift rather than lost concepts; simple probes and distillation constraints can detect and mitigate drift. Feasibility depends on access to modifiable LMMs, data and licenses, compute budgets, and the maturity of monitoring and governance practices.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.

alphaXiv