Nemotron-Parse-1.1: Lightweight OCR & Doc Parser

Updated 26 November 2025

Nemotron-Parse-1.1 is a lightweight Vision-Language Transformer integrating OCR, markdown, and layout analysis for complex document parsing.
It employs an 885M parameter encoder–decoder architecture with efficient vision token compression to process dense documents at high throughput.
The model achieves competitive benchmark results in reading order, OCR accuracy, table extraction, and multilingual document understanding.

Nemotron-Parse-1.1 is a lightweight, end-to-end Vision-Language Transformer for document parsing and Optical Character Recognition (OCR), designed to process complex documents with dense content such as tables, diagrams, and rich formatting. The model advances the capabilities of its predecessor, Nemoretriever-Parse-1.0, by integrating tasks including plain OCR, markdown and LaTeX formatting, structured table extraction, bounding box detection, and multilingual dense text recognition. Built with an 885 million parameter encoder–decoder architecture, Nemotron-Parse-1.1 achieves competitive accuracy across public document understanding benchmarks while maintaining a tractable model size suitable for high-throughput production deployment. It is publicly released, including model weights, an optimized container, and a substantial subset of training data as part of the Nemotron-VLM-v2 dataset (Chumachenko et al., 25 Nov 2025).

1. Model Structure and Architecture

Nemotron-Parse-1.1 employs an encoder–decoder transformer architecture with a total of 885M parameters, partitioned as follows: 657M in the 40-layer ViT-H/16-based vision encoder, ~256M in a customized 10-layer mBART-derived language decoder, and <1M in the convolutional “neck” and multi-token heads. The vision encoder, initialized from RADIO v2.5, processes color images $I \in \mathbb{R}^{3 \times H \times W}$ , splitting them into $16 \times 16$ non-overlapping patches and encoding via a standard transformer pipeline with a hidden dimension $d=1024$ . A lightweight neck composed of 1×4 horizontal convolutional kernels reduces the spatial dimension horizontally by 4× and compresses the sequence length (e.g., from ≈8,384 to 3,200 tokens for $1648 \times 2048$ images), yielding the encoder output $Z_{\text{enc}} = \mathcal{N}(E(I)) \in \mathbb{R}^{(N \times d)}$ , with $N = 3,200 + 1$ including a summary token.

The language decoder employs weight tying in self-attention and feed-forward blocks and omits explicit 1D positional embeddings (“NoPE”), relying solely on the causal attention mask for ordering. Each layer contains standard self-attention, cross-attention, and feed-forward networks, with attention formulated as:

$\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

$\mathrm{CA}(Q,K,V_{\text{enc}}) = \mathrm{softmax}\left(\frac{Q W_q (K_{\text{enc}} W_k)^T}{\sqrt{d_k}}\right)(V_{\text{enc}} W_v)$

Multi-token inference heads accelerate training by enabling parallel prediction of up to $m$ tokens per step via additional linear projections.

2. Integrated OCR and Parsing Methodologies

Nemotron-Parse-1.1 unifies multiple document analysis subtasks through a prompt-conditioned interface. Three independent prompt tokens permit dynamic runtime control:

Text formatting: <output_markdown>, <output_plain>, <output_no_text>
Bounding box prediction: <predict_bbox>, <no_bbox>
Semantic class labeling: <predict_classes>, <no_classes>

The maximal information prompt <output_markdown><predict_bbox><predict_classes> enables full-featured output, including markdown/Latex-formatted text, bounding boxes scaled to a $1024 \times 1280$ canonical grid, and semantic class predictions. For each semantic block $i$ , the decoder predicts four normalized coordinates $(x_{1i}, y_{1i}, x_{2i}, y_{2i}) \in [0,1]$ . The model outputs tables in LaTeX tabular format and can extract embedded text from diagrams or charts by processing these regions as semantic blocks.

The bounding box regression loss, while not explicitly specified, typically adopts an L1 or IoU loss:

$\mathcal{L}_\text{box}(b,\,\hat b) = \|b - \hat b\|_1 \quad \text{or} \quad 1-\mathrm{IoU}(b, \hat b)$

Inline and display formulas are formatted in LaTeX, and outputs are convertible to alternative markup or structured content as required for downstream processing.

3. Long-Sequence Handling and Token Compression

Nemotron-Parse-1.1 is distinguished by several innovations enabling efficient handling of visually dense or long documents:

NoPE Decoder: By forgoing explicit positional embeddings in the decoder, the model generalizes to long contexts based on the monotonic signal from causal masking.
Vision Token Compression: The standard “conv-neck” reduces image patches into 3,200 vision tokens plus a summary token. The “TC” (Token Compression) variant applies a PixelShuffle operation, compressing token count further from 3,200 to 833 ( $\times$ 16 reduction), enabling faster inference.
Multi-token Decoding: During training, the architecture predicts multiple tokens in parallel per step, increasing data throughput for dense outputs; at inference, single-token decoding remains available for maximal precision.

These features facilitate direct parsing of complex multi-page PDFs with upwards of 3,000 tokens per context.

4. Benchmark Performance and Empirical Results

Nemotron-Parse-1.1 demonstrates competitive accuracy across several document parsing, OCR, and layout benchmarks:

Method	WER↓	F1↑
Kosmos-2.5 (ocr)	0.195	0.937
Kosmos-2.5 (md)	0.249	0.890
GOT (ocr)	0.302	0.818
GOT (md)	0.259	0.879
Nemotron-Parse	0.109	0.958
Nemotron-Parse-TC	0.111	0.953

For reading order, Nemotron-Parse-1.1 achieves a WER of 0.109 and F1 of 0.958, outperforming prior end-to-end and modular baselines (Chumachenko et al., 25 Nov 2025). On the GOT benchmark, Nemotron-Parse-1.1 reports OCR F1 of 0.9785 (Parse) and 0.9755 (TC), reading-order edit distances of 0.014, and BLEU scores of 0.9623 (Parse) and 0.9582 (TC).

On OmniDocBench v1.0 (English subset):

Model	Tokens	overall	text	formula	table	order
SmolDocling	392	0.493	0.262	0.753	0.729	0.227
Nemotron-Parse	3201	0.131	0.052	0.288	0.118	0.066
Nemotron-Parse-TC	833	0.129	0.055	0.295	0.121	0.048

For table extraction (TEDS / S-TEDS), Nemotron-Parse-1.1 attains 86.2/79.9 (RD-TableBench) and 81.3/94.0 (PubTabNet), indicating strong performance on structural and semantic table metrics.

Multilingual performance across English, German, French, Spanish, Chinese, and Japanese documents shows WER between 0.03 and 0.06, and F1 between 0.96 and 0.98.

5. Token Compression Variant: Nemotron-Parse-1.1-TC

Nemotron-Parse-1.1-TC leverages a PixelShuffle operation for vision token reduction (3,200→833), providing a 20% token throughput gain on NVIDIA H100 GPUs in bf16 precision (3,800→4,500 tok/s). The quality trade-off is minimal: F1 and overall metrics degrade by only 0.003–0.005, WER increases by 0.002, and BLEU diminishes by 0.004. This variant is intended for high-throughput settings where inference speed is a premium, with only minor reductions in output fidelity (Chumachenko et al., 25 Nov 2025).

6. Pretraining Data and Model Release

Nemotron-Parse-1.1 is pretrained on approximately 22 million document images and pages sourced from:

NVpdftex (8.3M arXiv pages with precise LaTeX–image alignment)
SynthTabNet (480K synthetic tables)
DocLayNet (56K layout samples)
Common Crawl human-labeled pages (255K)
Synthetic multilingual dense text (≈3.5M)
Wikipedia OCR data (9.5M)
Additional sets: PubTabNet, FinTabNet, TabRecSet

A substantial portion of synthetic and human-labeled data is released in the Nemotron-VLM-Dataset-v2 on HuggingFace. Model weights for both the full and TC variant are provided, along with an optimized NIM container for inference on Nvidia hardware.

7. Inference, Deployment, and Integration

Nemotron-Parse-1.1 performs optimally on A100/H100 GPUs in bf16, with cost-effective deployment facilitated by fp16 or (forthcoming) INT8 quantization. The model supports multi-token decoding and processes long-context, visually dense documents at approximately 4 pages/s (Parse) and 5 pages/s (TC). Integration with vLLM enables high-throughput batched inference, and inference acceleration can be achieved through NVIDIA TensorRT or FasterTransformer, which provide sub-millisecond cross-attention steps.

A typical prompt for maximal parsing functionality is <output_markdown><predict_bbox><predict_classes>. The following Python example demonstrates usage via HuggingFace Transformers:

from transformers import AutoProcessor, AutoModelForSeq2SeqLM
proc = AutoProcessor.from_pretrained("nvidia/NVIDIA-Nemotron-Parse-v1.1")
model = AutoModelForSeq2SeqLM.from_pretrained("nvidia/NVIDIA-Nemotron-Parse-v1.1")
image = load_image("doc.pdf")
inputs = proc(images=image, return_tensors="pt").to("cuda")
prompt = "<output_markdown><predict_bbox><predict_classes>"
out = model.generate(**inputs, max_new_tokens=2048, prefix=prompt)
result = proc.batch_decode(out, skip_special_tokens=True)

In summary, Nemotron-Parse-1.1 consolidates core document understanding tasks—OCR, layout analysis, structural extraction, multilingual content parsing—within a lightweight transformer framework, with optional token compression for efficient large-scale deployment (Chumachenko et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

NVIDIA Nemotron Parse 1.1 (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Nemotron-Parse-1.1.