MemLoRA-V: On-Device Vision Memory

Updated 7 December 2025

MemLoRA-V is a vision-augmented memory system that equips small vision-language models with specialized LoRA adapters for on-device memory retrieval and direct visual reasoning.
It employs a modular architecture with adapters for extraction, update, and generation, integrating fine-grained visual features alongside textual data.
The system achieves competitive performance on text QA and VQA benchmarks compared to larger cloud-based models while operating under strict resource constraints.

MemLoRA-V is a vision-augmented memory system that equips small vision-LLMs (SVLMs) with specialized adapter modules for efficient, on-device memory-augmented generation and native visual understanding. It extends the MemLoRA methodology—which replaces large cloud-based LLM memory systems with compact small LLMs (SLMs) and LoRA adapters—to the multimodal domain, enabling both persistent memory retrieval and direct visual reasoning in local, resource-constrained settings while maintaining competitive performance with cloud-scale baselines (Bini et al., 4 Dec 2025).

1. System Motivation and Context

Traditional memory-augmented LLMs achieve extended, context-aware reasoning by persisting conversation history or extracted knowledge as external memory. However, these systems are costly and impractical for on-device deployment, both due to compute and memory requirements and because they lack native support for multimodal (vision+language) input. Caption-based approaches for multimodal memory are suboptimal, as they discard fine-grained spatial and visual information crucial for direct image reasoning.

MemLoRA addresses these limitations for text-based dialogue by leveraging a compact SLM augmented with three task-specific Low-Rank Adaptation (LoRA) adapters, providing on-device persistent memory functionality. MemLoRA-V generalizes this pipeline by introducing SVLMs (e.g., InternVL3-1B/2B) and a fourth, vision-specific adapter, enabling end-to-end memory-augmented visual understanding and question answering without reliance on large cloud-hosted models or external captioners (Bini et al., 4 Dec 2025).

2. Architectural Overview

MemLoRA-V comprises the following primary components:

Base SVLM ( $f_{\theta_{SV}}$ ): A frozen, 1–2B parameter multimodal transformer, integrating a visual encoder for image patch embeddings and a text encoder, fused by late cross-modal attention.
Memory Adapters: Four LoRA adapters, each specializing in a distinct operation:
- $L_e$ (Extraction): Extracts candidate factual memories from input text/image.
- $L_u$ (Update): Applies ADD/UPDATE/DELETE operations to the persistent memory $\mathcal{M}$ .
- $L_g$ (Generation): Handles text-based memory retrieval and response generation.
- $L_g^V$ (Vision Generation): Extends $L_g$ for direct visual question answering (VQA) by jointly conditioning on textual, retrieved memory, and visual embeddings.
Memory Store ( $\mathcal{M}$ ): A local, persistent data structure holding factual strings, updated and retrieved as needed per user session.

A high-level data flow is as follows: the input text $x$ (tokenized) and, if present, image $I$ (embedded by the visual encoder) are processed in a multi-stage pipeline, with the active adapter loaded depending on the current task—extraction, update, or generation. For queries requiring visual reasoning, $\tilde v$ (the projected image embedding) is fused with text and memory in $L_g^V$ to produce the autoregressive output.

3. Formal Memory Operations and Adapter Training

The core memory-augmented operations in MemLoRA-V are as follows:

Knowledge Extraction

$\Omega = E(\mathcal{M}, x, v) = \arg\max_\omega p(\omega \mid x; \theta_S, L_e)$

Here, $\Omega$ represents the set of candidate facts extracted, leveraging both textual and visual input.

Memory Update

$\mathcal{M}' = U(\mathcal{M}, x, \Omega) = \left\{ m_i,\,\text{event}_i \right\}_{i=1}^N \cup \left\{\text{new ADD entries from }\Omega \right\}$

The adapter $L_u$ merges new knowledge with existing memory entries using structured update events.

Memory-Augmented Generation

$y = G(\mathcal{M}', x, v) = \arg\max_{y} p(y \mid x, \Omega';\, \theta_S, L_g)$

For multimodal queries, $L_g^V$ attends to the concatenation of text encoding ${\rm TextEnc}(x)$ , retrieved memory embeddings ${\rm Emb}(\Omega')$ , and projected image $\tilde v$ .

All adapters are trained with next-token cross-entropy objectives, utilizing teacher outputs from large models for extraction and update, and gold targets for generation and VQA. No knowledge distillation via KL or logit matching is applied. The formulation for each adapter's loss is:

$\mathcal{L}_\text{extr} = -\sum_{t=1}^{|\Omega|}\log p(\omega_t\mid x), \quad \mathcal{L}_\text{upd} = -\sum_{e=1}^{E}\log p(e_t\mid \mathcal{M},x), \quad \mathcal{L}_\text{gen} = -\sum_{t=1}^{L}\log p(y_t\mid x,\Omega')\,.$

Hyperparameters, including learning rate and batch size, are set to ensure efficient convergence via early stopping.

4. Visual Modality Integration

The SVLM backbone fuses vision and language using late cross-modal attention. The visual encoder outputs patch embeddings $v = \mathrm{VEnc}(I) \in {\mathbb R}^{P \times D}$ . These are linearly projected to the model's hidden dimension:

$\tilde v = W_v v + b_v$

Within each cross-modal layer, textual queries $Q$ attend jointly to key/value pairs synthesized from both text and projected visual embeddings, computed per attention head as:

$\alpha_{ij} = \frac{\exp(Q_i K_j^\top / \sqrt{d})}{\sum_{j'} \exp(Q_i K_{j'}^\top / \sqrt{d})}, \quad \text{Attention}_i = \sum_j \alpha_{ij} V_j$

This architecture allows MemLoRA-V to perform direct, spatially-aware visual reasoning, which is infeasible for caption-based memory systems.

5. Inference Workflow and System Constraints

The MemLoRA-V pipeline for a multimodal query proceeds as follows:

Tokenize $x$ ; encode via TextEnc. Encode image $I$ via VisualEnc $\rightarrow \tilde v$ .
Use $L_e$ for knowledge extraction: $\Omega = E(\mathcal{M}_\text{old}, x, \tilde v)$ .
Use $L_u$ to update persistent memory: $\mathcal{M}_\text{new} = U(\mathcal{M}_\text{old}, x, \Omega)$ ; store locally.
Retrieve top- $k$ relevant memory entries $\Omega' \subset \mathcal{M}_\text{new}$ using, e.g., FAISS over memory embeddings.
For generation, activate either $L_g$ (text-only) or $L_g^V$ (vision), and condition the decoder on the concatenation $[{\rm TextEnc}(x); {\rm Emb}(\Omega'); \tilde v]$ .
Generate output $y$ autoregressively ( $T=0$ ).

The complete system size is 1–2 GB for the SLM/SVLM backbone; LoRA adapters contribute <1% parameter overhead. The memory store typically resides in tens to hundreds of small text entries (few MB). The implementation achieves throughput of $\sim$ 47 tokens/sec and $\sim$ 0.7 s latency per answer on a single A100 GPU. Memory lookups and updates are conducted entirely on-device.

6. Empirical Evaluation

Experiments utilize the LoCoMo benchmark, which comprises ten long, multi-session dialogue tasks for text QA, extended with a challenging VQA benchmark containing single-word counting, color, and unusual-object queries, automatically generated and validated by a large vision-LLM (InternVL3-78B).

Performance metrics include:

Composite score $L$ (aggregate of ROUGE-1, METEOR, BERTScore, and SentenceBERT)
LLM-as-a-Judge score $J$ (binary CORRECT/WRONG from GPT-OSS-120B)
VQA accuracy $V$ (one-word exact match)

Results summary:

Model and Config	$L$	$J$	$V$	Size (GB)
Mem0 (Gemma2-27B, caption)	—	39.1	23.7	$\sim$ 50
GPT-OSS-120B	—	48.9	22.0	$\sim$ 60
MemLoRA-V (InternVL3-2B + 4 ad.)	44.6	40.3	81.3	4.9

MemLoRA-V outperforms caption-based approaches in VQA by 57.6 percentage points (V = 81.3 vs. 23.7), while matching or exceeding Gemma2-27B on the LLM-as-a-Judge text QA metric, despite being 10–20 $\times$ smaller and substantially faster per inference (Bini et al., 4 Dec 2025).

7. Significance and Prospects

MemLoRA-V demonstrates that native visual memory integration with small, locally-deployable models is feasible and efficient, achieving parity or better on both text QA and VQA benchmarks relative to much larger cloud-scale competitors. Direct feature-level visual-text fusion circumvents the limitations of caption-based modalities, preserving spatial structure and fine-grained image content in downstream memory-augmented reasoning. This capability enables privacy-preserving, multimodal assistants and memory systems for mobile and edge devices without the computational burdens of LLM-scale models or external visual APIs.

A plausible implication is that LoRA-based adapter modularity provides a scalable path for extending memory and reasoned access to additional modalities beyond vision, all while respecting strict compute and storage requirements (Bini et al., 4 Dec 2025).

Markdown Upgrade to Chat

References (1)

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemLoRA-V.

MemLoRA-V: On-Device Vision Memory

1. System Motivation and Context

2. Architectural Overview

3. Formal Memory Operations and Adapter Training

4. Visual Modality Integration

5. Inference Workflow and System Constraints

6. Empirical Evaluation

7. Significance and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

MemLoRA-V: On-Device Vision Memory

1. System Motivation and Context

2. Architectural Overview

3. Formal Memory Operations and Adapter Training

4. Visual Modality Integration

5. Inference Workflow and System Constraints

6. Empirical Evaluation

7. Significance and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research