LongCodeOCR: Visual Code Compression

Updated 7 February 2026

LongCodeOCR is a visual code compression method that renders entire code contexts into 2D images, preserving global structure and dependencies.
The framework uses optical tokenization and multimodal Transformers to enable near-lossless reasoning over ultra-long codebases.
It significantly reduces preprocessing time and maintains high context coverage compared to traditional textual compression methods.

LongCodeOCR is a visual code compression framework for long-context code understanding using vision-LLMs (VLMs). It addresses the loss of dependency closure and semantic fragmentation inherent in conventional textual filtering-based code compression by rendering entire code contexts as compressed two-dimensional image sequences, which are then processed by VLMs. This approach enables near-lossless global structure preservation at high compression ratios and allows vision-language systems to reason over codebases with context lengths that exceed the window limitations of standard LLMs (Zhong et al., 31 Jan 2026).

1. Motivation and Problem Setting

Traditional LLMs exhibit significant context window constraints, leading to the truncation or aggressive textual compression of large codebases. Existing textual code compression frameworks, such as LongCodeZip, rely on block-wise filtering using mutual information heuristics and knapsack optimization. While effective for reducing token budgets, these methods fragment code dependencies and incur high preprocessing overhead: at a 1M-token input, LongCodeZip requires approximately 15,415 s (~4.3 h) and 221.8M LLM tokens just for the compression step. Critically, important inter-block and cross-file semantic relations are often disrupted, causing "semantic fragmentation" and undermining global code understanding (Zhong et al., 31 Jan 2026). LongCodeOCR circumvents these limitations by shifting from token selection to dense 2D rendering, thus preserving all information for downstream reasoning without pruning.

2. Visual Compression Pipeline

The LongCodeOCR pipeline comprises three principal stages:

Context Bottleneck: The raw code context $C_{\text{context}}$ , often exceeding window limitations, is processed as a whole without textual truncation.
Visual Rendering: The entire codebase is split into $n$ image “pages” $\mathcal{V} = \{v_i\}_{i=1}^{n}$ , each of resolution $(H_i, W_i)$ . These are produced such that code indentation, file hierarchy, and symbol recurrence are visually preserved.
Optical Tokenization and Encoding: Each page $v_i$ is encoded by a vision backbone (e.g., Qwen3-VL-8B or the specialized Glyph) into $\tau(v_i)$ visual tokens, where for Qwen3-VL-8B, $\tau(v_i) = H_i W_i / (16^2 \cdot 4)$ and for Glyph, $\tau(v_i) = H_i W_i / (14^2 \cdot 4)$ . The total number of visual tokens is $|C_{\text{visual}}| = \sum_{i=1}^n \tau(v_i)$ .
Cross-Modal Processing: A multimodal Transformer receives a prompt (e.g., “Summarize this module”) concatenated with the visual token sequence, enabling joint textual and cross-modal reasoning for target tasks (summarization, Q&A, or completion).

This 2D “optical” bottleneck ensures all code dependencies, formatting, and structure are retained, overcoming the irreversible loss from filtering-based compressors (Zhong et al., 31 Jan 2026).

3. Formal Metrics: Coverage, Fidelity, and Compression Ratio

LongCodeOCR introduces and evaluates several core metrics:

Compression Ratio ( $r$ ):
- Visual: $r_{\text{visual}} = \frac{|C_{\text{context}}|}{|C_{\text{visual}}|}$
- Textual (e.g., LongCodeZip): $r_{\text{textual}} = \frac{|C_{\text{context}}|}{|C_{\text{compressed}}|}$
Coverage ( $C$ ): Fraction of original context available post-compression. For visual, $C \approx 1$ due to lossless rendering; for text, $C \approx 1/r_{\text{textual}}$ .
Fidelity ( $F$ ): Degree to which the downstream model can recover exact code symbols. Assessed by Exact Match (EM), Edit Similarity (ES) for completion, and CompScore (summarization; referee LLM-based pairwise preference aggregation): $\mathrm{CompScore} = \frac12 \left[ P(s_o \succ \hat s) + (1 - P(\hat s \succ s_o)) \right]$

Visual code compression offers maximal context coverage but, at extreme compression, introduces symbol-level noise due to small glyph artifacts (“fidelity bottleneck”). Textual methods preserve symbol fidelity for retained tokens but risk losing global dependencies as context is aggressively pruned (Zhong et al., 31 Jan 2026).

4. Architecture and Implementation

While visual compression is VLM-agnostic, LongCodeOCR is evaluated with:

Qwen3-VL-8B: A general-purpose VLM, employing patching and pooling for optical tokenization.
Glyph (9B parameters): A vision-LLM specialized for ultra-long context compression. Glyph performs LLM-guided font and layout search to maximize rendered token density (choosing among monospaced, multi-column, etc.), and is trained on a custom curriculum aligning dense code glyphs to target outputs. Glyph can efficiently process up to 1M-token code contexts, completing page rendering and encoding in ~70 s (vs. ~4.3 h for LongCodeZip), with zero LLM tokens spent in preprocessing (Zhong et al., 31 Jan 2026).

The pipeline proceeds as follows:

Tokenize and render all code to images.
Encode images to visual tokens.
Compose the multimodal input (instruction + visual tokens).
Perform inference using the VLM decoder.

Comparison with DeepSeek-OCR (Wei et al., 21 Oct 2025) shows conceptual parallels—optical mapping for compression, vision transformers for encoding, and high-throughput scalability—though DeepSeek-OCR targets document OCR rather than code. Both frameworks employ aggressive compression ratios (often 10× or more) while demonstrating competitive fidelity compared to specialized textual approaches.

5. Empirical Results

Key results from large-scale evaluation across four code understanding tasks:

Task	Ratio (r)	LongCodeZip (Textual)	LongCodeOCR (Visual, Qwen3-VL-8B)	LongCodeOCR (Glyph, 9B)
Summarization (CompScore)	~1.7×	50.40	87.25 (+36.85)	—
Repo-level Completion (ES)	~2.0×	39.48	50.22 (+10.74)	60.68 (+21.20)
File-level Completion (LCC/ES)	~2.0×	36.93	42.21 (+5.28)	45.90
File-level Completion (LCC/EM)	~2.0×	10.60	12.00 (+1.40)	13.60
Code QA (32k–128k)	~1.6×	67.61%	70.46%	—
Code QA (0.5–1M; Ultra-long)	3–12×	64.0%	72.0% (8.1×)	70.0% (12.2×)
Compression Overhead (1M tokens)	—	4.3 h, 221.8M tokens	70 s, 0 tokens	70 s, 0 tokens

On Long Module Summarization, LongCodeOCR surpasses text compressors by +36.85 CompScore at matched ratio. For repo-level completion, the visual approach improves Edit Similarity by up to +21.20. For multi-file and ultra-long array tasks (up to 1M tokens), LongCodeOCR reliably outperforms LongCodeZip at much higher compression, while reducing preprocessing latency by two orders of magnitude (Zhong et al., 31 Jan 2026).

6. Analysis: Coverage–Fidelity Trade-Off

The principal findings concerning the theoretical limits and trade-offs inherent in visual code compression are:

Global Coverage: Visual code rendering ensures full structural and relational code context is available to the VLM, preserving all linguistic and semantic dependencies. This proves critical for tasks emphasizing holistic understanding, such as summarization and cross-file completion.
Symbol Fidelity: At aggressive compression (small glyphs), minor losses in symbol-level information emerge, affecting strict EM metrics. This suggests a task-dependent trade-off: prioritizing visual methods for structure-centric tasks and textual filtering for symbol-critical applications.
Graceful Degradation: As code length increases, visual methods degrade more gradually than textual ones, which must severely prune context, leading to fragmented and incoherent reasoning (Zhong et al., 31 Jan 2026).
Computational Efficiency: Rendering-based compression is dramatically faster (1 min vs. 4.3 h at 1M tokens), consumes no LLM inference tokens at preprocessing, and scales efficiently to million-token regimes.

7. Limitations, Extensions, and Future Directions

Visual compression struggles at extreme densities: tiny glyphs introduce OCR artifacts, harming exact symbol recovery (EM). Rendering choices (e.g., variable font size for critical regions) and hybrid pipelines (visual for global intake, textual for local fidelity) are under active investigation. External validation with proprietary LLMs remains pending. A plausible implication is that integration of adaptive rendering and target-aware hybrid methods could offer improved coverage-fidelity balance for diverse code intelligence tasks (Zhong et al., 31 Jan 2026).

Future work may involve multi-stage pipelines combining visual and textual selection, automated font/layout adaptation, and broad benchmarking across closed, proprietary VLMs for in situ robustness assessment.

Markdown Report Issue Upgrade to Chat

References (2)

Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression (2026)

DeepSeek-OCR: Contexts Optical Compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongCodeOCR.