Papers
Topics
Authors
Recent
2000 character limit reached

MemLoRA-V: On-Device Vision Memory

Updated 7 December 2025
  • MemLoRA-V is a vision-augmented memory system that equips small vision-language models with specialized LoRA adapters for on-device memory retrieval and direct visual reasoning.
  • It employs a modular architecture with adapters for extraction, update, and generation, integrating fine-grained visual features alongside textual data.
  • The system achieves competitive performance on text QA and VQA benchmarks compared to larger cloud-based models while operating under strict resource constraints.

MemLoRA-V is a vision-augmented memory system that equips small vision-LLMs (SVLMs) with specialized adapter modules for efficient, on-device memory-augmented generation and native visual understanding. It extends the MemLoRA methodology—which replaces large cloud-based LLM memory systems with compact small LLMs (SLMs) and LoRA adapters—to the multimodal domain, enabling both persistent memory retrieval and direct visual reasoning in local, resource-constrained settings while maintaining competitive performance with cloud-scale baselines (Bini et al., 4 Dec 2025).

1. System Motivation and Context

Traditional memory-augmented LLMs achieve extended, context-aware reasoning by persisting conversation history or extracted knowledge as external memory. However, these systems are costly and impractical for on-device deployment, both due to compute and memory requirements and because they lack native support for multimodal (vision+language) input. Caption-based approaches for multimodal memory are suboptimal, as they discard fine-grained spatial and visual information crucial for direct image reasoning.

MemLoRA addresses these limitations for text-based dialogue by leveraging a compact SLM augmented with three task-specific Low-Rank Adaptation (LoRA) adapters, providing on-device persistent memory functionality. MemLoRA-V generalizes this pipeline by introducing SVLMs (e.g., InternVL3-1B/2B) and a fourth, vision-specific adapter, enabling end-to-end memory-augmented visual understanding and question answering without reliance on large cloud-hosted models or external captioners (Bini et al., 4 Dec 2025).

2. Architectural Overview

MemLoRA-V comprises the following primary components:

  • Base SVLM (fθSVf_{\theta_{SV}}): A frozen, 1–2B parameter multimodal transformer, integrating a visual encoder for image patch embeddings and a text encoder, fused by late cross-modal attention.
  • Memory Adapters: Four LoRA adapters, each specializing in a distinct operation:
    • LeL_e (Extraction): Extracts candidate factual memories from input text/image.
    • LuL_u (Update): Applies ADD/UPDATE/DELETE operations to the persistent memory M\mathcal{M}.
    • LgL_g (Generation): Handles text-based memory retrieval and response generation.
    • LgVL_g^V (Vision Generation): Extends LgL_g for direct visual question answering (VQA) by jointly conditioning on textual, retrieved memory, and visual embeddings.
  • Memory Store (M\mathcal{M}): A local, persistent data structure holding factual strings, updated and retrieved as needed per user session.

A high-level data flow is as follows: the input text xx (tokenized) and, if present, image II (embedded by the visual encoder) are processed in a multi-stage pipeline, with the active adapter loaded depending on the current task—extraction, update, or generation. For queries requiring visual reasoning, v~\tilde v (the projected image embedding) is fused with text and memory in LgVL_g^V to produce the autoregressive output.

3. Formal Memory Operations and Adapter Training

The core memory-augmented operations in MemLoRA-V are as follows:

  1. Knowledge Extraction

Ω=E(M,x,v)=argmaxωp(ωx;θS,Le)\Omega = E(\mathcal{M}, x, v) = \arg\max_\omega p(\omega \mid x; \theta_S, L_e)

Here, Ω\Omega represents the set of candidate facts extracted, leveraging both textual and visual input.

  1. Memory Update

M=U(M,x,Ω)={mi,eventi}i=1N{new ADD entries from Ω}\mathcal{M}' = U(\mathcal{M}, x, \Omega) = \left\{ m_i,\,\text{event}_i \right\}_{i=1}^N \cup \left\{\text{new ADD entries from }\Omega \right\}

The adapter LuL_u merges new knowledge with existing memory entries using structured update events.

  1. Memory-Augmented Generation

y=G(M,x,v)=argmaxyp(yx,Ω;θS,Lg)y = G(\mathcal{M}', x, v) = \arg\max_{y} p(y \mid x, \Omega';\, \theta_S, L_g)

For multimodal queries, LgVL_g^V attends to the concatenation of text encoding TextEnc(x){\rm TextEnc}(x), retrieved memory embeddings Emb(Ω){\rm Emb}(\Omega'), and projected image v~\tilde v.

All adapters are trained with next-token cross-entropy objectives, utilizing teacher outputs from large models for extraction and update, and gold targets for generation and VQA. No knowledge distillation via KL or logit matching is applied. The formulation for each adapter's loss is:

Lextr=t=1Ωlogp(ωtx),Lupd=e=1Elogp(etM,x),Lgen=t=1Llogp(ytx,Ω).\mathcal{L}_\text{extr} = -\sum_{t=1}^{|\Omega|}\log p(\omega_t\mid x), \quad \mathcal{L}_\text{upd} = -\sum_{e=1}^{E}\log p(e_t\mid \mathcal{M},x), \quad \mathcal{L}_\text{gen} = -\sum_{t=1}^{L}\log p(y_t\mid x,\Omega')\,.

Hyperparameters, including learning rate and batch size, are set to ensure efficient convergence via early stopping.

4. Visual Modality Integration

The SVLM backbone fuses vision and language using late cross-modal attention. The visual encoder outputs patch embeddings v=VEnc(I)RP×Dv = \mathrm{VEnc}(I) \in {\mathbb R}^{P \times D}. These are linearly projected to the model's hidden dimension:

v~=Wvv+bv\tilde v = W_v v + b_v

Within each cross-modal layer, textual queries QQ attend jointly to key/value pairs synthesized from both text and projected visual embeddings, computed per attention head as:

αij=exp(QiKj/d)jexp(QiKj/d),Attentioni=jαijVj\alpha_{ij} = \frac{\exp(Q_i K_j^\top / \sqrt{d})}{\sum_{j'} \exp(Q_i K_{j'}^\top / \sqrt{d})}, \quad \text{Attention}_i = \sum_j \alpha_{ij} V_j

This architecture allows MemLoRA-V to perform direct, spatially-aware visual reasoning, which is infeasible for caption-based memory systems.

5. Inference Workflow and System Constraints

The MemLoRA-V pipeline for a multimodal query proceeds as follows:

  1. Tokenize xx; encode via TextEnc. Encode image II via VisualEnc v~\rightarrow \tilde v.
  2. Use LeL_e for knowledge extraction: Ω=E(Mold,x,v~)\Omega = E(\mathcal{M}_\text{old}, x, \tilde v).
  3. Use LuL_u to update persistent memory: Mnew=U(Mold,x,Ω)\mathcal{M}_\text{new} = U(\mathcal{M}_\text{old}, x, \Omega); store locally.
  4. Retrieve top-kk relevant memory entries ΩMnew\Omega' \subset \mathcal{M}_\text{new} using, e.g., FAISS over memory embeddings.
  5. For generation, activate either LgL_g (text-only) or LgVL_g^V (vision), and condition the decoder on the concatenation [TextEnc(x);Emb(Ω);v~][{\rm TextEnc}(x); {\rm Emb}(\Omega'); \tilde v].
  6. Generate output yy autoregressively (T=0T=0).

The complete system size is 1–2 GB for the SLM/SVLM backbone; LoRA adapters contribute <1% parameter overhead. The memory store typically resides in tens to hundreds of small text entries (few MB). The implementation achieves throughput of \sim47 tokens/sec and \sim0.7 s latency per answer on a single A100 GPU. Memory lookups and updates are conducted entirely on-device.

6. Empirical Evaluation

Experiments utilize the LoCoMo benchmark, which comprises ten long, multi-session dialogue tasks for text QA, extended with a challenging VQA benchmark containing single-word counting, color, and unusual-object queries, automatically generated and validated by a large vision-LLM (InternVL3-78B).

Performance metrics include:

  • Composite score LL (aggregate of ROUGE-1, METEOR, BERTScore, and SentenceBERT)
  • LLM-as-a-Judge score JJ (binary CORRECT/WRONG from GPT-OSS-120B)
  • VQA accuracy VV (one-word exact match)

Results summary:

Model and Config LL JJ VV Size (GB)
Mem0 (Gemma2-27B, caption) 39.1 23.7 \sim50
GPT-OSS-120B 48.9 22.0 \sim60
MemLoRA-V (InternVL3-2B + 4 ad.) 44.6 40.3 81.3 4.9

MemLoRA-V outperforms caption-based approaches in VQA by 57.6 percentage points (V = 81.3 vs. 23.7), while matching or exceeding Gemma2-27B on the LLM-as-a-Judge text QA metric, despite being 10–20×\times smaller and substantially faster per inference (Bini et al., 4 Dec 2025).

7. Significance and Prospects

MemLoRA-V demonstrates that native visual memory integration with small, locally-deployable models is feasible and efficient, achieving parity or better on both text QA and VQA benchmarks relative to much larger cloud-scale competitors. Direct feature-level visual-text fusion circumvents the limitations of caption-based modalities, preserving spatial structure and fine-grained image content in downstream memory-augmented reasoning. This capability enables privacy-preserving, multimodal assistants and memory systems for mobile and edge devices without the computational burdens of LLM-scale models or external visual APIs.

A plausible implication is that LoRA-based adapter modularity provides a scalable path for extending memory and reasoned access to additional modalities beyond vision, all while respecting strict compute and storage requirements (Bini et al., 4 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemLoRA-V.