Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unlimited OCR Works

Published 22 Jun 2026 in cs.CV and cs.CL | (2606.23050v1)

Abstract: Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a LLM as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.

Summary

  • The paper presents a novel Reference Sliding Window Attention (R-SWA) mechanism that decouples memory usage from sequence length.
  • It replaces standard decoder attention with R-SWA, enabling one-shot, high-throughput parsing across multi-page, multimodal documents.
  • Experimental results demonstrate state-of-the-art accuracy with improved table extraction and constant inference speed regardless of document length.

Unlimited OCR: Advancing Efficient Long-Horizon Parsing with Reference Sliding Window Attention

Introduction

"Unlimited OCR Works" (2606.23050) addresses a persistent bottleneck in end-to-end Optical Character Recognition (OCR): sustaining high efficiency during long-horizon parsing of documents spanning dozens of pages in a single inference pass. While recent models leverage LLM-based decoders to integrate powerful linguistic priors, such architectures exhibit prohibitive memory growth and latency due to unbounded Key-Value (KV) cache expansion in vanilla attention schemes when processing long output sequences. This sharply contrasts human cognitive behaviors, where working memory enables fluid, context-aware copying without increasing recall latency. The paper introduces Unlimited OCR, which overcomes these scalability constraints through a novel Reference Sliding Window Attention (R-SWA) mechanism, enabling efficient one-shot parsing across multimodal, multi-page settings.

Methodology

Network Architecture

Unlimited OCR builds on DeepSeek OCR’s high-compression DeepEncoder and replaces every decoder attention layer with R-SWA. The vision encoder, a cascade of window and global attention ViTs, attains up to a 16× token compression rate. The decoder is a 3B-parameter Mixture-of-Experts (MoE) LLM, with just 0.5B active parameters per pass, maximizing inference speed and resource efficiency.

Reference Sliding Window Attention (R-SWA)

R-SWA introduces a two-segment attention window. Decoding tokens always attend globally to prefix tokens—the static visual tokens and document prompt—while only locally attending within a sliding window of width nn (default: 128) over the preceding outputs. Unlike vanilla Multi-Head Attention, whose KV cache grows with sequence length (Lm+TL_m + T), R-SWA’s design bounds the cache at Lm+nL_m + n, where LmL_m is the (constant) prefix length. This sharply truncates memory demand and enables constant inference latency regardless of output length.

The formalized attention slice for each token tt in the decode region is:

  • N(t)=PDn(t)N(t) = P \cup D_n(t), where PP denotes the prefix tokens, and Dn(t)D_n(t) is the local output token window.
  • The resulting attention distribution matches human-like "soft-forgetting," maintaining fresh attention to immediate outputs while preserving full reference accessibility.

Distinctively, the visual tokens do not participate in state transitions, avoiding the progressive degradation ("blurring") that would erode feature fidelity in linear or fully recurrent attention architectures.

Inference and Training

Expanding on DeepSeek OCR, Unlimited OCR is fine-tuned on 2 million document OCR samples (single and multi-page) with 32K output context, parallelized across 8×16 A800s using Megatron-LM and leveraging DeepEP for efficient expert parallelism. During inference, custom R-SWA cache management ensures throughput (TPS) and GPU memory remain stable even as sequence length grows.

Experimental Results

OmniDocBench Evaluation

Unlimited OCR achieves substantial performance improvements over both pipeline-based and prior end-to-end models:

Model Params OmniDocBench v1.5 Overall Text Edit Dist. Table TEDS (%) Inference Speed (TPS)
DeepSeek OCR 3B 87.01 0.073 84.97 4951
DeepSeek OCR 2 3B 89.17 0.049 85.60 -
Unlimited OCR 3B 93.23 0.038 90.93 5580

Unlimited OCR attains a 6.22% improvement in overall score and a 5.96% boost in table structure extraction over DeepSeek OCR. On the latest v1.6 benchmark, Unlimited OCR scores 93.92, further confirming the efficacy and generality of R-SWA.

Long-horizon Parsing

Unlimited OCR demonstrates consistent performance as the number of pages increases, with edit distances remaining below 0.11 even at >40 pages and >96% distinct-n for long n-grams, indicating effective preservation of information and context tracking across extensive outputs.

TPS remains constant as output length increases, whereas DeepSeek OCR exhibits an inevitable decline in throughput due to linear cache growth—a 35% speed gap emerges by 6,000 tokens.

Subcategory Analysis

Across diverse document structures (PPTs, academic papers, books, magazines, notes), Unlimited OCR consistently outperforms baselines in both text edit distance and reading order, indicating that R-SWA’s benefits generalize beyond homogenous layouts.

Implications and Future Directions

Unlimited OCR demonstrates that attention mechanisms emulating human working memory—persistent references to source data and dynamic, narrow output context—yield scalable, high-fidelity long-horizon document parsing. This architecture decouples output latency and memory usage from sequence length, a property immediately valuable for OCR, but theoretically applicable to other reference-based sequence transduction tasks such as ASR and machine translation.

The primary limitation remains the fixed maximum context for visual/prefix tokens (e.g., 32K). Surmounting this by further increasing context length or introducing dynamic prefix retrieval—akin to human page flipping—could extend applicability to truly unbounded document analysis.

Conclusion

Unlimited OCR establishes R-SWA as a practical, lossless substitute for full-sequence attention in large-scale multimodal document parsing. With state-of-the-art accuracy, markedly improved throughput, and constant resource usage across arbitrarily long outputs, this work redefines the scalability envelope for end-to-end OCR architectures and points toward a new paradigm for long-horizon sequence modeling in multimodal LLMs (2606.23050).

Whiteboard

Explain it Like I'm 14

Overview

This paper is about making computers much better at reading long documents, like entire books or long PDFs, all at once. The authors introduce a new OCR system called “Unlimited OCR” that can transcribe many pages in one go without slowing down or running out of memory. The key idea is a new way of “paying attention” while writing the output, called Reference Sliding Window Attention (R-SWA), which copies how people focus when they copy text by hand.

Key questions the paper asks

Here are the simple questions the researchers wanted to answer:

  • How can we make a computer read and transcribe many pages in a row without getting slower or using too much memory?
  • Can we keep accuracy high while doing this?
  • Is there a general trick (not just for OCR) that helps with long tasks like speech-to-text or translation?

How the method works

When a modern OCR model writes down text, it uses a part called a “decoder” that decides what the next character or word should be. Traditional decoders look back at everything they have already written, which becomes slower and more memory-hungry as the text gets longer.

To understand why that’s a problem, think of two ideas:

  • Attention: This is how the model decides what to look at before writing the next word.
  • KV cache: This is the model’s short-term memory of what it has seen and written so far. In normal systems, this memory grows and grows as the output gets longer, which eats RAM and slows everything down.

The big idea: Reference Sliding Window Attention (R-SWA)

R-SWA changes what the decoder looks at:

  • It always sees all the “reference” tokens. In OCR, these are the image features (the picture of the page) and the prompt. Think of this like keeping the original book open in front of you the whole time.
  • It only looks at a small, recent window of what it just wrote (for example, the last 128 tokens). Think of this like glancing at the last few words you wrote to stay on track, instead of rereading the entire page every time.

Why this helps:

  • The model’s memory (KV cache) stops growing with output length. It becomes “constant size,” like having a fixed-size notepad instead of a notebook that gets heavier and heavier.
  • Because the reference image tokens are never “updated” as the model writes, their details don’t get blurred or lost. The model sees the picture clearly the whole time.

The full system: Unlimited OCR

Unlimited OCR combines two parts:

  • DeepEncoder: A visual encoder that compresses the input images a lot (e.g., a 1024×1024 page can be turned into just 256 “visual tokens”). This keeps the image side small but rich in detail.
  • An LLM decoder with R-SWA: A LLM that writes the text while looking at the image tokens and a short window of recent output. It uses a mixture-of-experts design to be fast while staying accurate.

Together, these let the system read dozens of pages in one pass under a typical 32K token limit, without slowing down as the output gets longer.

What they did to test it

The team trained the model on about 2 million OCR examples (mostly single-page, some multi-page stitched together) and then evaluated it on a standard benchmark called OmniDocBench, which checks many skills:

  • Reading text
  • Reading math formulas
  • Understanding table structures
  • Predicting the correct reading order

They also tested long documents (2, 5, 10, 20, and 40+ pages) to see if the system stays accurate and fast even when the output is very long.

Main findings

Here are the most important results:

  • Higher accuracy: On OmniDocBench v1.5, the overall score reached about 93%, beating the baseline model (DeepSeek OCR) by roughly 6 percentage points. It also performed at state-of-the-art levels on v1.6.
  • Faster and steadier: As the output gets longer, normal models slow down because their memory keeps growing. Unlimited OCR stays fast because its memory stays constant. In some tests with long outputs, it was up to about 35% faster.
  • Works on many pages at once: It can transcribe many pages in a single pass, keeping accuracy solid even with 20+ pages, and still doing well at 40+ pages.
  • No trade-off in quality: Replacing standard attention with R-SWA didn’t hurt accuracy; it actually improved it for text, tables, and reading order in many document types.

Why this matters:

  • Reading long documents typically means splitting them into pages and starting fresh each time, which loses context and is slow. Unlimited OCR handles long documents in one flow, like a person copying a book without breaking concentration.

Implications and impact

This work suggests a new, more “human-like” way for AI to handle long tasks:

  • For OCR: It means one-shot parsing of big PDFs and books, with consistent speed and memory use.
  • For other tasks: The same R-SWA idea can be used for speech-to-text (ASR), translation, and other long “reference-based” jobs where the model should always see the source (audio, text, image) plus just a little of what it has just produced.
  • In practice: This can lower costs (less memory), speed up processing, and make large-scale document understanding more reliable.

A simple caveat:

  • It’s not truly “unlimited.” The input (image tokens) still has to fit into a maximum context length. The authors plan to train for longer contexts (like 128K) and design a “prefill pool” to load parts of the reference as needed—like flipping through pages when necessary.

Bottom line

Unlimited OCR introduces a smarter way to focus attention: always look at the source, and only glance at the last bit of what you wrote. This keeps memory and speed steady, even for very long documents, and it improves accuracy. It’s a practical step toward AI systems that can handle long, continuous tasks more like humans do.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to guide concrete follow-up research.

  • Lack of controlled ablations isolating R-SWA’s impact: Train a baseline with standard MHA under identical continued-training data/schedule and compare to R-SWA to attribute gains beyond additional fine-tuning.
  • Sensitivity to the sliding window size n: Systematically vary n (e.g., 32–512), and study the accuracy–latency trade-off, drift behavior, and error types across document lengths and layouts.
  • Per-layer/per-head windowing strategies: Evaluate heterogeneous R-SWA configurations (e.g., larger n in higher layers or selected heads) to identify minimal compute for maximal accuracy.
  • Comparison against alternative efficient attention schemes: Benchmark R-SWA against Mistral-style sliding window, chunked attention, Transformer-XL recurrence, SSMs, hybrid cross-attention, and KV-eviction variants under identical training.
  • Scaling with large prefix m: Quantify per-token latency and memory as m grows (pages×tokens), not only across decode steps but also across prefix sizes (e.g., 1k–50k visual tokens); provide speed–memory curves vs m.
  • Practical limits of “dozens of pages”: Establish the maximum number of pages supported at various resolutions within 32K and beyond (e.g., 64K/128K), and measure degradation curves in accuracy and speed.
  • Generalization beyond OCR remains untested: Implement and evaluate R-SWA on ASR (including streaming) and machine translation to validate the “general-purpose parsing” claim.
  • Long-horizon evaluation breadth: Move beyond edit distance and Distinct-n to document-level metrics—duplication/omission rates, cross-page consistency, global reading order across pages, and cross-references (e.g., section/figure citations).
  • Real multi-page vs synthetic multi-page training: Replace concatenated single pages with native multi-page documents to assess whether training on realistic inter-page dependencies yields different behavior.
  • Dataset and benchmark transparency: Release the in-house long-horizon dataset or provide detailed descriptions and seeds to enable reproducible evaluation of “40+ pages” claims.
  • Risk of training–test contamination: Audit training data to ensure no overlap with OmniDocBench test sets and document measures taken to prevent leakage.
  • Robustness to non-PDF and difficult inputs: Evaluate on camera-captured documents, low-light/blur, skewed/curved pages, non-Latin scripts, right-to-left/vertical text, handwriting, and unusual fonts.
  • Encoder compression vs fine detail: Quantify how the 16× compression affects small text and fine structures; explore multi-scale encoders, adaptive compression, or partial encoder fine-tuning to mitigate small-text errors.
  • Resolution strategy for multi-page: Study dynamic per-page resolution (“Gundam” mode) in multi-page settings, including scheduling/prefetch policies and their impact on throughput and accuracy.
  • Positional encoding and extrapolation: Specify the positional encoding (e.g., RoPE) and test extrapolation to sequences longer than 32K; evaluate RoPE scaling or alternative schemes under R-SWA.
  • Interaction with MoE routing: Analyze how sliding-window constraints affect expert activation patterns, stability, and load balancing; explore EP settings and routing regularization.
  • Decoding strategies beyond greedy: Assess compatibility and performance with beam search, constrained decoding, and temperature sampling under KV-eviction and R-SWA masks.
  • Page-level memory management beyond 32K: Prototype the proposed “prefill pool” (page-wise KV chunking/fetching) and design policies for fetching/eviction that preserve alignment and speed.
  • Subtoken/tokenizer effects: Examine how tokenizer choices (e.g., formula-specific vocabularies) influence formula CDM and table TEDS under R-SWA; compare general vs domain-specific tokenizers.
  • Mixing attention types: Test hybrids where some layers retain full attention (or cross-attention to prefix) while others use R-SWA to check if partial full-attention restores global consistency for edge cases.
  • Long-range semantic consistency: Create tasks probing cross-page references (TOC-to-section alignment, figure/table references, citations/footnotes) to test if limited decode history (n) harms global coherence.
  • Drift detection and mitigation: Instrument runs to detect position drift/looping over very long outputs and explore adaptive n, memory “reminders,” or periodic summary tokens to prevent degradation.
  • Efficiency on diverse hardware and loads: Report absolute memory footprints and wall-clock throughput across GPUs, batch sizes, concurrencies, and varying m,n; include non-ideal concurrency scenarios.
  • Real-time/streaming OCR: Evaluate whether R-SWA supports incremental page ingestion and streaming outputs (chunked prefix updates) without re-prefill, including latency jitter and stability.
  • Handling extremely large m: Investigate saliency-based or learned selection of reference tokens to cap effective m (e.g., top-k visual tokens, hierarchical references) and measure impact on accuracy.
  • Structural tasks across pages: Benchmark cross-page tables/figures, multi-page tables of contents, and threading of reading order through section boundaries; current metrics focus mostly on single-page structure.
  • Training regime limitations: Only 4k continued-training steps with frozen encoder—evaluate from-scratch training with R-SWA and joint encoder–decoder tuning to understand data/compute requirements.
  • Window adaptation to layout complexity: Explore dynamic n conditioned on local difficulty (e.g., denser layouts, rotated text) with policies learned during training.
  • Robustness to page order errors: Test resilience to shuffled/missing pages and the model’s ability to recover or signal ordering issues under limited decode memory.
  • Non-autoregressive or chunk-parallel decoding: Investigate whether R-SWA can enable block-wise decoding (e.g., per region/page) while maintaining global consistency via reference tokens.
  • Reproducibility and code details: Provide kernel/implementation specifics (masking, precision, FlashAttention configs), unit tests for KV-eviction correctness, and cross-engine equivalence checks.

Practical Applications

Immediate Applications

The following items can be deployed with the paper’s released code/weights and standard GPU inference stacks (Transformers/SGLang), leveraging R-SWA and the DeepEncoder to enable stable-latency, long-horizon OCR.

  • One-shot, multi-page enterprise OCR at scale — Sectors: finance, insurance, legal, government
    • What it enables: End-to-end extraction of long document bundles (e.g., KYC/loan packages, claims files, contracts) in a single pass with consistent throughput and memory usage.
    • Tools/workflows: Wrap Unlimited OCR as a microservice; replace page-by-page loops in existing RPA/document pipelines with one-shot prefill+decode; export to JSON/HTML/Markdown for downstream systems (DMS, ERP).
    • Assumptions/dependencies: GPU inference (A100/A800-class) or cloud GPUs; 32K max sequence context limits total prefills (dozens of pages with DeepEncoder compression); small-font accuracy may require higher-resolution “Gundam” mode.
  • Reading-order–aware PDF-to-accessible-text conversion — Sectors: public sector, education, accessibility
    • What it enables: Produce tagged PDF/HTML/EPUB with correct reading order for screen readers and Section 508/WCAG compliance; consistent results on long reports without page re-initialization artifacts.
    • Tools/workflows: Batch conversion service for agencies/universities; reading-order outputs drive auto-tagging of PDFs.
    • Assumptions/dependencies: Layouts unseen in training may require light post-processing; compliance validation workflows remain necessary.
  • Scientific/technical document conversion with tables and formulas — Sectors: academia, publishing, pharma R&D
    • What it enables: High-quality extraction of tables (TEDS gains) and LaTeX-like math from long PDFs (theses, reports, regulatory submissions) in one pass; fewer stitching errors across pages.
    • Tools/workflows: PDF-to-XML/JSON or LaTeX pipelines; data lake ingestion for literature mining and RAG.
    • Assumptions/dependencies: Some formula/table corner cases may need domain post-processing; ensure licensing for large-scale corpus conversion.
  • Archive and records digitization with stable throughput — Sectors: libraries, cultural heritage, public records
    • What it enables: Digitize books/newspapers/archives across dozens of pages per job with constant TPS and memory; reduces job orchestration and operator overhead.
    • Tools/workflows: Integrate with scanning/MFP fleets and OCR servers; output searchable text and structured metadata.
    • Assumptions/dependencies: Image quality varies; low-resolution scans may need re-scan or higher-res encoder settings.
  • Cloud OCR SaaS with predictable latency and cost — Sectors: software/SaaS, cloud platforms
    • What it enables: Offer SLAs for long documents (flat per-token latency beyond 256 tokens); better capacity planning and autoscaling.
    • Tools/workflows: Deploy with SGLang + FlashAttention v3; concurrency tuning to maintain steady TPS.
    • Assumptions/dependencies: GPU availability; request batching strategies; monitoring for prefill-length ceilings.
  • Confidential on-prem OCR for regulated industries — Sectors: healthcare (EHR scanning), finance (audits), defense
    • What it enables: Process sensitive multi-page documents in-house on modest GPU servers due to bounded KV cache; reduces memory pressure compared to full-attention decoders.
    • Tools/workflows: Secure on-prem deployment, air-gapped pipelines, audit logging.
    • Assumptions/dependencies: Security reviews; data retention policies; verify domain accuracy for specialized forms.
  • RAG/content ingestion preprocessing — Sectors: enterprise AI, software
    • What it enables: Reliable, reading-order–correct text extraction from long PDFs prior to chunking/indexing, improving downstream retrieval quality.
    • Tools/workflows: Plug into ETL jobs (Airflow/Beam); produce clean, de-duplicated text with <page> delimiters.
    • Assumptions/dependencies: Existing retrievers may need chunking tuned to R-SWA’s page-level continuity; handle tables/formulas as structured artifacts as needed.
  • Dataset creation and weak supervision for document AI research — Sectors: academia, ML/AI tooling
    • What it enables: Generate high-coverage annotations (text + block coordinates) at scale from PDFs to bootstrap or augment training datasets.
    • Tools/workflows: Use Unlimited OCR outputs as labels for distillation/active learning; cross-validate with smaller detectors.
    • Assumptions/dependencies: Annotation noise in complex layouts; consider human-in-the-loop verification for gold sets.
  • Workflow modernization (reduce orchestration/loop overhead) — Sectors: BPO/shared services, IT
    • What it enables: Replace brittle per-page loops and external schedulers with single-pass parsing, simplifying code and reducing latency variance.
    • Tools/workflows: Refactor pipelines to prefill all pages then decode; standardize logging/metrics around constant-window decoding.
    • Assumptions/dependencies: Sequence-length budget planning; backpressure handling for very large documents.

Long-Term Applications

These opportunities require further research, scaling, or engineering (e.g., 128K contexts, prefill-pool mechanism, domain adaptation, or extending R-SWA beyond OCR).

  • Hour-scale long-form ASR with constant memory — Sectors: media, call centers, legal compliance
    • What it enables: Transcribe podcasts, hearings, earnings calls, and multi-hour meetings without KV growth, enabling consistent real-time throughput.
    • Tools/products: R-SWA decoder in speech models; streaming ASR services with bounded caches.
    • Assumptions/dependencies: Training ASR with R-SWA; robust acoustic modeling; domain noise handling.
  • Book-length machine translation — Sectors: localization, publishing, education
    • What it enables: Translate long documents while attending to full source (reference) and a local output window; reduces compute costs vs full attention.
    • Tools/products: R-SWA applied to encoder–decoder NMT; publishing workflows for continuous document translation.
    • Assumptions/dependencies: Parallel corpora and training; evaluation on long-form discourse phenomena.
  • “Truly unlimited” OCR via prefill pools and 128K+ contexts — Sectors: archives, enterprise records
    • What it enables: Flip-through/page-fetch mechanism to handle hundreds to thousands of pages by dynamically swapping prefill KV chunks.
    • Tools/products: Prefill-cache manager; memory-mapped KV pools; hierarchical pagination strategies.
    • Assumptions/dependencies: Model training to learn prefill fetching; longer context hardware/software support.
  • Multimodal live meeting assist (slides + audio + transcripts) — Sectors: enterprise productivity, education
    • What it enables: Joint parsing of slide decks (as reference) and audio captions with R-SWA for near-zero drift over long sessions.
    • Tools/products: Meeting assistants with synchronized slide-aware ASR; lecture capture systems.
    • Assumptions/dependencies: Fusion training for audio+vision; latency constraints in real-time.
  • Edge/on-device long-horizon parsers — Sectors: mobile, IoT, MFP/scanners
    • What it enables: Smaller R-SWA models on NPUs/edge GPUs to OCR multi-page scans locally (privacy-preserving).
    • Tools/products: Quantized R-SWA models; device firmware integration.
    • Assumptions/dependencies: Model compression/quantization; memory-constrained KV implementations.
  • Deep document understanding to knowledge graphs — Sectors: finance (10-K/ESG), healthcare (clinical trials), policy analysis
    • What it enables: Cross-section linking (e.g., figures, tables, references) over entire reports/books, preserving long-range coherence for structured extraction.
    • Tools/products: Doc-to-graph pipelines; compliance/explainability dashboards.
    • Assumptions/dependencies: Additional IE components (NER, relation extraction); training on cross-page link tasks.
  • Energy- and cost-efficient LLM inference at long horizons — Sectors: cloud/infra, sustainability
    • What it enables: Flat memory/latency for long sequences reduces energy per token and cloud costs; supports “green AI” initiatives and ESG reporting.
    • Tools/products: R-SWA kernels optimized for vendors; autoscalers tuned for constant TPS.
    • Assumptions/dependencies: Vendor kernel support; standardized energy metering.
  • Robotics/operations agents that “read manuals” — Sectors: robotics, manufacturing, field service
    • What it enables: Agents that consult long manuals as reference while keeping a small working memory window to execute multi-step procedures.
    • Tools/products: R-SWA-enabled multimodal policy modules; maintenance copilots.
    • Assumptions/dependencies: Safety validation; integration with perception/control stacks.
  • Legal e-discovery and investigations at massive scale — Sectors: legal tech, compliance
    • What it enables: Parse and index millions of pages with predictable compute, enabling timely discovery and consistent performance across heterogeneous corpora.
    • Tools/products: Discovery platforms embedding R-SWA OCR; provenance/audit trails.
    • Assumptions/dependencies: Cluster orchestration; accuracy guardrails; chain-of-custody requirements.
  • Standards and benchmarking for long-horizon parsing — Sectors: research, policy, standards bodies
    • What it enables: New evaluation protocols beyond per-page metrics (e.g., drift, repetition, long-range coherence) and guidance for public-sector digitization.
    • Tools/products: Open benchmarks for >100-page parsing; procurement/spec templates for agencies.
    • Assumptions/dependencies: Community adoption; reproducibility infrastructure.

Cross-cutting assumptions and dependencies

  • Context length is the primary limiter today (32K), constraining total “one-shot” pages; planned 128K contexts and prefill-pool mechanisms will expand this.
  • Visual fidelity depends on DeepEncoder resolution; small text or dense layouts may need higher-res modes or domain-specific fine-tuning.
  • R-SWA window size n (default 128) may require task/domain tuning to balance coherence and compute.
  • Reliable deployment assumes GPU availability and optimized kernels (e.g., FlashAttention v3) with SGLang/Transformers support.
  • Compliance-sensitive sectors should retain human-in-the-loop validation and maintain auditability.
  • Multilingual and highly specialized domains may require additional training data and evaluation before production use.

Glossary

  • ASR: Automatic Speech Recognition; converting spoken language into text. "beyond OCR, it is equally applicable to tasks such as ASR, translation, etc."
  • Causal sliding window: A bounded, forward-only attention span over recent tokens that shifts as decoding progresses. "Dn (t) denotes the causal sliding window of width n over the decode region."
  • CDM (Formula CDM): A metric for evaluating mathematical formula recognition quality. "Formula CDM (CDM 1), which evaluates the quality of mathematical formula recognition;"
  • CLIP-ViT: The Vision Transformer backbone from CLIP used for image encoding. "cascades SAM-ViT [15] with CLIP-ViT [25]"
  • DeepEncoder: A high-compression visual encoder that reduces image tokens for efficient decoding. "DeepEncoder is originally introduced in DeepSeek OCR [34]."
  • DeepEP: An expert-parallel training system for Mixture-of-Experts models to support long sequences. "To support 32K training, we adopt DeepEP [18], with expert parallelism (EP) set to 4."
  • Distinct-n: A diversity metric measuring the ratio of unique n-grams to all n-grams in generated text. "Distinct-n is the ratio of the number of unique n-grams to the total number of n-grams in the generated text."
  • Expert parallelism (EP): A parallelization strategy that distributes MoE experts across devices. "with expert parallelism (EP) set to 4."
  • Flash Attention v3: A high-performance GPU attention kernel that reduces memory and latency. "Figure 1 | The latency of the Flash Attention v3 kernel as decoding length increases."
  • Global attention: An attention mechanism where tokens can attend to all others without locality constraints. "global attention is reserved exclusively for the compressed tokens."
  • KV cache: Stored key/value tensors from past tokens used to speed autoregressive attention during decoding. "the accumulated KV cache drives up memory consumption and progressively slows down generation."
  • KV cache eviction: Removing oldest key/value entries to keep a fixed-size cache during generation. "the KV corresponding to the (m + 1)-th token in the queue is evicted"
  • Megatron-LM: A distributed training framework for LLMs with model/data parallelism. "The entire training pipeline is built on the Megatron-LM [27] framework."
  • Mixture-of-Experts (MoE): A model architecture that routes inputs to a subset of specialized expert networks per token. "a Mixture-of-Experts (MoE) architecture that enjoys 3B total and 500M activated parameters."
  • Multimodal LLMs (MLLMs): LLMs that process multiple input modalities (e.g., text and images). "to explore how multimodal LLMs (MLLMs) [8, 14, 22, 28] can handle simple long-horizon parsing tasks"
  • OmniDocBench: A benchmark suite for evaluating document parsing across text, formulas, tables, and reading order. "We select OmniDocBench [23] as the main benchmark for evaluating foundational document OCR capabilities, and test the Unlimited OCR on both v1.5 and v1.6 versions."
  • One-shot parsing: Parsing an entire multi-page document in a single forward pass without page-by-page loops. "Unlimited OCR not only enables one-shot parsing of an entire book"
  • Paddle OCR: An OCR toolkit used for annotating and preparing training data. "we use Paddle OCR [11] for annotation"
  • Prefill: The initial phase of decoding that loads prefix (e.g., visual/prompt) tokens into the KV cache before generation. "the prefill length is fixed at 10"
  • Reading Order Edit Distance: A metric that measures how accurately a model predicts the sequence in which content should be read. "Reading Order Edit Distance (Edit ), which quantifies the correctness of predicted reading sequences."
  • Reference Sliding Window Attention (R-SWA): The proposed attention mechanism that attends to all reference (prefix) tokens and a fixed-width causal window of recent outputs. "We introduce Reference Sliding Window Attention (R-SWA), illustrated in Figure 2."
  • SAM-ViT: The Vision Transformer backbone from Segment Anything used as part of the image encoder. "It cascades SAM-ViT [15] with CLIP-ViT [25]"
  • SGLang inference engine: An inference runtime used to deploy and optimize the model’s decoding with R-SWA. "along with corresponding support and optimizations in the SGLang inference engine."
  • Sliding Window Attention (SWA): An attention pattern that restricts attention to a moving window over recent tokens. "Compared to vanilla SWA, it preserves visual token fidelity"
  • Table TEDS: A table-structure evaluation metric based on tree-edit distance including content recognition. "Table TEDS (TEDS 1)"
  • Table TEDS-S: A variant of TEDS that assesses table structure without content recognition. "Table TEDS-S (TEDS-S 1)"
  • Text Edit Distance: A character-level accuracy metric for text recognition tasks. "Text Edit Distance (Edit J), which measures character-level accuracy for text recognition;"
  • Token compression ratio: The reduction factor from raw image tokens to compressed tokens produced by the encoder. "the encoder's token compression ratio is insufficient"
  • Tokens per second (TPS): A throughput metric indicating how many tokens are generated per second. "Unlimited OCR achieves 5580 TPS (tokens/s/512 concurrency) compared to DeepSeek OCR's 4951 TPS"
  • Vision-LLMs (VLMs): Models that jointly process images and text for tasks like OCR and parsing. "With the advancement of vision-LLMs (VLMs) [6, 8, 14, 16, 32]"
  • Window attention: Attention confined to local spatial or token windows, commonly used in ViTs for efficiency. "relies entirely on window attention to process the original image tokens"
  • Working memory: A bounded memory mechanism inspired by human cognition that keeps recent context while softly forgetting distant past. "a model designed to emulate human parsing working memory."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 27 tweets with 1122 likes about this paper.

HackerNews

  1. Unlimited OCR: One-shot long-horizon parsing (476 points, 108 comments) 
  2. Unlimited OCR Works (3 points, 0 comments)