Nemotron-Parse-1.1-TC: Optimized OCR & Parsing
- Nemotron-Parse-1.1-TC is an optimized document parsing and OCR model that leverages token compression to reduce vision token length and boost throughput.
- It features an encoder-decoder architecture with 885M parameters, integrating a Vision Encoder, a convolutional Vision Neck, and a 10-layer mBART Language Decoder.
- Achieving up to a 20% inference speedup and less than 1 percentage point accuracy loss across tasks, it is ideal for high-throughput document ingestion.
Nemotron-Parse-1.1-TC is an optimized document parsing and OCR model designed to deliver high throughput with minimal quality trade-offs relative to the standard Nemotron-Parse-1.1 backbone. As a variant leveraging token compression via reduced vision token length, it offers significant computational speedups while preserving the ability to extract structured content—including markdown-formatted text, table structure, bounding boxes, and semantic classes—from visually dense and complex documents (Chumachenko et al., 25 Nov 2025).
1. Model Composition and Architecture
Nemotron-Parse-1.1-TC is built on an encoder-decoder backbone totaling 885 million parameters. The architecture consists of:
- Vision Encoder : RADIO (ViT-H/16), comprising 657M parameters, responsible for initial visual feature extraction.
- Vision Neck : A horizontally oriented convolutional sequence (1×4 convolutions, stride 1×4), transforming the patch sequence produced by the encoder into a reduced set of 3200 tokens (), appended with RADIO’s summary token.
- Language Decoder : A 10-layer mBART model with tied weights totaling 256M parameters, decoding the reduced visual tokens into text, markup, or structured outputs.
- Token Compression (TC) Modification: Post neck, a pixel-shuffle operation downsamples the 3200-token sequence to a compressed 833-token representation (). Except for this token reduction step, all model components—including multi-token inference heads, NoPE decoder, and prompt/output handlers—remain unaltered.
2. Vision Token Compression Mechanism
Token compression in Nemotron-Parse-1.1-TC is achieved via a pixel-shuffle downsampling applied after feature extraction and neck reduction:
- The initial token sequence length is .
- Pixel-shuffle reduces this to , such that .
- This operation reorganizes spatially contiguous channel dimensions to form lower-resolution token positions, effecting a reduction from the raw Radio ViT output after the neck and yielding approximately fewer tokens compared to .
- All inference heads and decoding mechanisms remain unchanged, maintaining backward compatibility in prompt and output interface design.
3. Comparative Performance and Throughput
Nemotron-Parse-1.1-TC preserves the core accuracy profile of the full model while achieving notable inference speed gains. Key metrics, as reported in direct comparison, are summarized below.
| Metric/Benchmark | Full Model | TC Variant | Absolute Δ | Δ (%) |
|---|---|---|---|---|
| OCR WER (no mask) | 0.109 | 0.111 | +0.002 | +1.8% |
| OCR F1 (no mask) | 0.958 | 0.953 | -0.005 | -0.5% |
| GOT OCR F1 | 0.9785 | 0.9755 | -0.0030 | -0.31% |
| RO EditDist. | 0.014 | 0.014 | 0.000 | 0.0% |
| PubTabNet TEDS | 81.3 | 80.9 | -0.4 | |
| OmniDocBench TEDS | 82.68 | 84.73 | +2.05 | |
| Throughput (H100, bf16) | 3800 | 4500 | +700 | +18.4% |
Across all tasks—including OCR, reading order, table parsing, and markdown formatting—the TC variant consistently exhibits less than 1 percentage point (pp) absolute accuracy degradation. Notably, on English OmniDocBench, TC outperforms the full model in both TEDS and S-TEDS, with reading order “order” error reduced from 0.066 to 0.048.
4. Quality Retention Across Subtasks
Quality impact from token compression remains minimal, with degradations predominantly below actionable thresholds for most deployment scenarios:
- Negligible Loss ( pp): OCR F1, METEOR, BLEU, S-TEDS.
- Moderate Impact (–$1.0$ pp): Masked OCR F1, RD-TableBench TEDS.
- Stability: Reading order and multilingual OCR (F1 0.95 on seven languages) show comparable performance between TC and full variants. OmniDocBench scores show an improvement in structure parsing by the TC model.
- Speedup: TC attains approximately a 20% increase in throughput and reduced memory requirements, with single-GPU throughput rising from 3800 to 4500 tokens/sec (bf16, H100).
Tasks most stable under compression are those involving reading order and structure, with markdown-mode F1 reduction limited to 0.8 pp. A plausible implication is that high-level document structure is robust to moderate reductions in token granularity.
5. Public Release and Deployment Considerations
Nemotron-Parse-1.1-TC model weights are distributed in both fp32 and bf16 precision and are accessible for research and deployment via Huggingface at https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1-TC, with support for VLLM inference acceleration. The full model and the token-compressed TC variant both facilitate integration into end-user pipelines through maximal information prompts:
1 |
<output_markdown><predict_bbox><predict_classes> |
This enables extraction of markdown-formatted text, bounding box coordinates, and semantic class assignments in a unified output. The optimized NIM container is available for the full model (https://build.nvidia.com/nvidia/nemotron-parse). The training corpus, released as a subset of Nemotron-VLM-Dataset-v2, encompasses synthetic tables, multilingual OCR data, NVpdftex pipeline outputs, and public layout datasets.
6. Applications and Suitability
Nemotron-Parse-1.1-TC is appropriate when low latency and reduced hardware footprint are prioritized without substantial sacrifice of accuracy. Its deployment is particularly suited to scenarios involving high-throughput document ingestion, visually dense input, and large-scale batch inference. The ≈ 20% speedup relative to the full variant, with pp average accuracy loss, positions it as an effective trade-off in time- or resource-constrained environments (Chumachenko et al., 25 Nov 2025).