Nemotron-VLM-v2 Dataset Overview

Updated 26 November 2025

The Nemotron-VLM-v2 Dataset is a comprehensive multi-modal repository with 8M samples spanning document images, graphics, videos, and text sequences.
It uses robust annotation schemas and preprocessing pipelines, including OCR verification and synthetic augmentation, to support reliable vision-language model training.
Its balanced data splits and integrations from public crawls and established corpora enable advanced evaluations in tasks like image captioning, VQA, and document extraction.

The Nemotron-VLM-v2 Dataset is the open-sourced constituent of the training data used for developing the Nemotron Nano V2 VL vision-LLM suite. Designed to facilitate document understanding, long video comprehension, and reasoning across vision and text modalities, this dataset comprises 8 million meticulously curated samples—a scale and scope aimed at robust multi-domain performance in large vision-language transformers. Nemotron-VLM-v2 covers a diverse array of modalities and task categories, reflecting both established benchmarks and synthetic data pipelines. Its composition, annotation strategies, and preprocessing methodologies enable a high degree of compatibility with advanced architectures utilizing hybrid Mamba-Transformer LLMs and innovative token reduction techniques (NVIDIA et al., 6 Nov 2025).

1. Dataset Composition and Modalities

Nemotron-VLM-v2 consists of $N_{total} = 8,000,000$ training entities, represented in four principal modalities:

Document images (scanned pages, charts, tables, GUI screenshots): ~45% (≈3.6 million)
Web screenshots & graphics (infographics, diagrams): ~30% (≈2.4 million)
Short videos (instructional, egocentric, recipes): ~20% (≈1.6 million)
Text-only sequences (OCR transcripts, code/math Q&A): ~5% (≈0.4 million)

Task coverage spans both classic and contemporary vision-language evaluation suites and problem domains:

Image Captioning: e.g., OpenImages, TextCaps, PixMo-cap, TextVQA
Visual Question Answering: e.g., VQAv2, OK-VQA, GQA, ScienceQA, OCR-VQA, DocVQA, ChartQA, InfoVQA, among others
Visual Grounding: e.g., RefCOCO, Visual7W, ScreenQA
OCR and Document Extraction: e.g., SynthDog-en, TextOCR, DocLayNet, WebSight, TabRecSet, FinTabNet, PubTables-1M
Chart/Table Reasoning: e.g., ChartQA, PlotQA, DVQA, TabMWP, SimChart9K, AI2D, UniChart
Video Captioning & QA: e.g., YouCook2, VaTeX, Localized Narratives, TVQA, TVQA+, CLEVRER, LLaVA-Video-178K
Code/Math Reasoning: e.g., GPQA, LiveCodeBench, MATH-500, SciCode

This heterogeneous coverage is intended to ensure comprehensive training and evaluation for real-world document and video understanding tasks, as well as code and mathematical reasoning.

2. Data Sources and Collection Methodology

The dataset compiles samples from multiple sources, integrating both organic and synthetic data streams:

Public Web-Scale Crawls:
- CommonCrawl PDF samples undergo human-verified OCR (via NVPDFTex) and multilingual translations (mBART).
- Wikimedia dumps contribute sourced text transcripts.
Established Vision-Language Corpora:
- Image data from OpenImages, TextCaps, TextVQA, PixMo-cap
- Visual QA from VQAv2, OK-VQA, GQA, CLEVR, ScienceQA, OCR-VQA, etc.
- Document/Table datasets: DocLayNet, SynthTabNet, TextOCR, TabRecSet, PubTables-1M, FinTabNet
- Chart: ChartQA, InfoVQA, FigureQA, PlotQA, AI2D, UniChart, SimChart9K
- GUI: ScreenQA, WaveUI-25K
Video and Multi-Image Sources:

Incorporation from benchmarks such as YouCook2, VaTeX, Localized Narratives, EgoExoLearn, TVQA, CLEVRER, Perception Test, Ego4D, ActivityNet.

Synthetic and Model-Augmented Contributions:
- Synthetic table and data visualizations via pipelines (e.g., SynthDog-en, NVPDFTex)
- Synthetic QA pairs by LLMs Qwen2.5-VL, GLM-4.x, Qwen3 in under-labeled domains

This pipeline allows the dataset to reflect a distinctive balance between annotated benchmarks and rich, semi-automated generation for coverage expansion.

3. Annotation and Preprocessing Protocols

Multiple annotation schemas and preprocessing routines are applied to standardize and optimize the multi-modal data:

Annotation Schema:
- OCR: bounding boxes with transcript text
- Chart/Table: cell coordinates with structure graphs
- QA: (question, answer) pairs using both MCQ and open formats
- Visual Grounding: referring expressions paired with target bounding boxes
Human Verification:
- Spot-checks for OCR and synthetic pipeline outputs
- Sanity filters eliminate low-confidence or corrupted records
Synthetic Augmentation:
- LLM-generated step-by-step ("chain-of-thought") traces for STEM
- Template-driven QA generation
Preprocessing Pipelines:
- Images: sliced into 512×512 tiles (maximum 12 tiles), plus a 512×512 thumbnail; pixel-shuffle downsampling from 1024→256 tokens/tile
- Videos: sampled at 2 fps, capped at 128 frames, with uniform sampling if duration > 64 seconds
- Text: SentencePiece/BPE with 32K vocabulary; online sequence packing via balance-aware buffered strategy
- SFT: Loss square-averaging to address sequence length bias

4. Dataset Splits and Stratification

Official splits allocate data as follows:

Split	Count
Training	7,600,000
Validation	200,000
Test	200,000

Split stratification preserves modality ratios and task category breadth in each partition, with additional domain-specific holdouts:

10% of charts (ChartQA, PlotQA) reserved for a "chart-domain" test subset
10% of CommonCrawl PDF samples held out specifically for long-context document evaluation

This split structure supports generalization analyses both globally and within challenging content types.

5. Statistical Summaries

Detailed statistical characterization includes:

Modality	Approximate Samples
Document images	3.6 million
Graphics	2.4 million
Video clips	1.6 million
Text-only	0.4 million

Additional properties:

Average document sequence length: $\approx 1,\!200$ tokens
Average video duration: mean $\approx 34$ s, standard deviation $\approx 18$ s
Average frames per video: $min(128, 2 \text{ fps} \times \text{duration}) \implies \approx 68$ frames

The dataset split equation holds: $N_{total} = N_{train} + N_{val} + N_{test} = 8,\!000,\!000 = 7,\!600,\!000 + 200,\!000 + 200,\!000$

6. Usage Guidelines and Licensing

License: CC-BY-NC-4.0, restricting use to non-commercial research and fine-tuning.
Recommended Preprocessing:
- Resize images/tiles to 512×512 before processing with the vision encoder
- Sample videos at 2 fps, cap at 128 frames, and consider EVS pruning at 70–80% for efficient inference
- Apply the provided SentencePiece (32K vocab) tokenizer and pixel-shuffle token reduction
Best Practices:
- Utilize online sequence packing with balance-aware batch strategies for fine-tuning or evaluation
- Retain bounding box/segmentation annotations for grounding or structure extraction tasks
- Aggregate OCR outputs via Nemo Retriever Parse for multi-page or long-document tasks, maintaining original order
- Employ loss square-averaging to avoid disproportionate influence of atypical sequence lengths during supervised fine-tuning

These guidelines ensure that downstream users of the Nemotron-VLM-v2 Dataset can reproduce the data handling strategies underlying Nemotron models, and that benchmarking or fine-tuning studies can sustain consistency with the published baseline (NVIDIA et al., 6 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

NVIDIA Nemotron Nano V2 VL (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Nemotron-VLM-v2 Dataset.