MemOCR: Visual Memory in Long-Horizon LLMs
- MemOCR is a multimodal memory management approach that transforms linear text into a two-dimensional visual layout using Markdown formatting to encode information priority.
- It adaptively compresses memory by allocating larger font sizes to critical content and downsampling less salient details, enabling efficient context utilization.
- Reinforcement learning is integrated to optimize drafting and reading policies, resulting in superior reasoning accuracy and up to 8× improvement in token efficiency compared to traditional methods.
MemOCR is a multimodal memory management approach for long-horizon, agentic reasoning tasks in LLM agents. It eschews the traditional practice of text-only, token-limited memory compression by reallocating memory into a two-dimensional visual layout that allows adaptive information density based on semantic salience. MemOCR maintains a structured rich-text memory, applies Markdown-based layout priorities, and ultimately renders this memory as an image. This enables more robust retention of crucial information under tight context budgets, leveraging layout and visual prominence to preserve key evidence while aggressively compressing low-value details. MemOCR integrates reinforcement learning strategies aware of context constraints to optimize drafting and reading policies, achieving superior reasoning accuracy under context limitations compared to text-based baselines (Shi et al., 29 Jan 2026).
1. Motivations and Problem Formulation
Long-horizon agentic reasoning in autonomous LLM-based systems faces a fundamental memory bottleneck: the Transformer backbone offers a fixed context window, while histories in real deployments quickly exceed tens or hundreds of thousands of tokens. Traditional remedies—such as iteratively summarizing history into a condensed textual memory or retrieving fixed-length snippets—exhibit uniform information density, where every token consumes equal budget regardless of its contribution to downstream performance. This uniform token cost forces an undesirable tradeoff: preserving critical evidence necessitates discarding supporting detail, causing a steep drop in task performance as context budgets shrink.
MemOCR addresses this limitation by introducing visually structured memory. It transforms the agent’s memory stream from a one-dimensional text buffer into a two-dimensional visual "canvas." Here, font size, formatting (e.g., headings, boldface), and spatial placement encode information priority. Under aggressive compression—operationalized by downsampling—the most salient content remains legible, while extraneous details collapse into visual noise. This approach creates a nonuniform allocation of information budget, decoupling semantic value from raw token count (Shi et al., 29 Jan 2026).
2. System Architecture and Memory Workflow
MemOCR operates in two principal stages:
- Memory Drafting (Text Domain): At each time step , the agent’s drafting policy updates a persistent Markdown-based rich-text memory using :
The drafting process selects both the content to retain and its Markdown formatting (headings, subheadings, bold text, indentation), with each formatting decision encoding a visual priority.
- Memory Reading (Vision Domain): Once the full interaction history is processed, the final memory is rendered deterministically to an image via a Markdown-to-HTML-to-screenshot pipeline. Given a context budget , the image is downsampled so the agent’s vision encoder receives at most image patches. The agent then conditions on and the query to generate its answer:
Crucially, the area of a text segment of length at font scale scales as , enabling higher-priority content to occupy more robust regions on the canvas (Shi et al., 29 Jan 2026).
3. Adaptive Information Density and Visual Compression
MemOCR’s innovation lies in moving from uniform, token-based budgeting to adaptive, spatial-based budgeting. High-value information segments (“crucial evidence”) receive large font scales (), ensuring survivability through aggressive visual downsampling. Auxiliary details are assigned small font scales () and occupy less visual real estate. Upon compression, the total information preserved is governed by the compression ratio:
Analysis shows that:
under low-resolution rendering. The nonuniform (adaptive) allocation allows MemOCR to robustly preserve important content even with a fixed or limited context budget , in contrast to traditional summaries, where all details degrade equally with compression (Shi et al., 29 Jan 2026).
4. Reinforcement Learning and Budget-Aware Training
MemOCR employs a budget-aware reinforcement learning protocol based on Group Relative Policy Optimization (GRPO) to prioritize information under varying contexts. Three specialized QA tasks guide the agent's drafting and reading policies:
- Standard QA (): Moderate budget, optimizing global correctness.
- QA with Augmented Memory (): Extreme downsampling ( in each dimension), forcing robustness to visual compression by promoting evidence survival in high-priority regions.
- QA with Augmented Question (): Detail-focused subquestions with uncompressed memory, encouraging recoverability of auxiliary information.
For each task , task-specific rewards are collected, and the drafting policy’s aggregate advantage is:
with , , . The agent thus learns a unified layout policy balancing robustness under both tight and relaxed budgets (Shi et al., 29 Jan 2026).
5. Experimental Evaluation and Comparative Results
MemOCR is evaluated on long-context versions of both multi-hop (HotpotQA, 2WikiMultiHopQA) and single-hop (Natural Questions, TriviaQA) QA benchmarks, with context lengths up to 100,000 tokens. The principal baselines are:
- Raw History Memory: E.g., Qwen2.5-Instruct, Qwen2.5-1M-Instruct.
- Textual Summary Memory: Systems such as Mem0, Mem, and MemAgent.
All text baselines use Qwen2.5-7B-Instruct, while MemOCR employs Qwen2.5-VL-7B-Instruct to support vision encoding.
Performance metrics are primarily subword-exact-match accuracy under budget constraints ():
- With full budget (), MemOCR achieves 74.6% average accuracy at 10k context, versus 67.8% for the best text baseline.
- Under severe compression ( patches), text-based methods collapse by >50% (e.g., MemAgent falls from 67.8% to 31.6%), while MemOCR only declines by ~16.6% (74.6% to 62.2%), resulting in up to an 8× improvement in effective token efficiency.
Performance curves confirm that MemOCR exhibits significantly more graceful degradation as the memory budget decreases. Oracle injection studies demonstrate that allocating evidence in high-visibility canvas regions yields disproportionately higher gains in compressed settings than comparable allocations in low-salience regions (Shi et al., 29 Jan 2026).
6. Ablation Analyses and Qualitative Observations
Ablation studies indicate that removing the "QA with Augmented Memory" () task severely compromises robustness at extremely low budgets. Omitting "QA with Augmented Question" () exerts a milder effect, primarily on recall for low-visibility details. Qualitative analyses illustrate MemOCR’s strengths (e.g., resilience of H1 headers such as “Gene MacLellan” under severe compression) and limitations (difficulty with side-by-side comparative reasoning in uniform font or when overall memory grows so large that individual font sizes become unreadably small) (Shi et al., 29 Jan 2026).
7. Future Directions and Limitations
MemOCR demonstrates that layout-aware, adaptive information density via visual rendering constitutes a highly effective method for context management in long-horizon LLM agents, outperforming text-based approaches in both overall accuracy and efficiency under constraint. Future research directions include extending visual memory to additional modalities such as planning logs, tool inventories, and personalized dialogs, exploring richer markup (HTML/CSS), and enhancing robustness to OCR artifacts or vision model limitations. An acknowledged limitation is the computational overhead associated with vision encoders and potential dependence on OCR reliability for full downstream utilization (Shi et al., 29 Jan 2026).