Nemotron-VLM-v2 Dataset Overview
- The Nemotron-VLM-v2 Dataset is a comprehensive multi-modal repository with 8M samples spanning document images, graphics, videos, and text sequences.
- It uses robust annotation schemas and preprocessing pipelines, including OCR verification and synthetic augmentation, to support reliable vision-language model training.
- Its balanced data splits and integrations from public crawls and established corpora enable advanced evaluations in tasks like image captioning, VQA, and document extraction.
The Nemotron-VLM-v2 Dataset is the open-sourced constituent of the training data used for developing the Nemotron Nano V2 VL vision-LLM suite. Designed to facilitate document understanding, long video comprehension, and reasoning across vision and text modalities, this dataset comprises 8 million meticulously curated samples—a scale and scope aimed at robust multi-domain performance in large vision-language transformers. Nemotron-VLM-v2 covers a diverse array of modalities and task categories, reflecting both established benchmarks and synthetic data pipelines. Its composition, annotation strategies, and preprocessing methodologies enable a high degree of compatibility with advanced architectures utilizing hybrid Mamba-Transformer LLMs and innovative token reduction techniques (NVIDIA et al., 6 Nov 2025).
1. Dataset Composition and Modalities
Nemotron-VLM-v2 consists of training entities, represented in four principal modalities:
- Document images (scanned pages, charts, tables, GUI screenshots): ~45% (≈3.6 million)
- Web screenshots & graphics (infographics, diagrams): ~30% (≈2.4 million)
- Short videos (instructional, egocentric, recipes): ~20% (≈1.6 million)
- Text-only sequences (OCR transcripts, code/math Q&A): ~5% (≈0.4 million)
Task coverage spans both classic and contemporary vision-language evaluation suites and problem domains:
- Image Captioning: e.g., OpenImages, TextCaps, PixMo-cap, TextVQA
- Visual Question Answering: e.g., VQAv2, OK-VQA, GQA, ScienceQA, OCR-VQA, DocVQA, ChartQA, InfoVQA, among others
- Visual Grounding: e.g., RefCOCO, Visual7W, ScreenQA
- OCR and Document Extraction: e.g., SynthDog-en, TextOCR, DocLayNet, WebSight, TabRecSet, FinTabNet, PubTables-1M
- Chart/Table Reasoning: e.g., ChartQA, PlotQA, DVQA, TabMWP, SimChart9K, AI2D, UniChart
- Video Captioning & QA: e.g., YouCook2, VaTeX, Localized Narratives, TVQA, TVQA+, CLEVRER, LLaVA-Video-178K
- Code/Math Reasoning: e.g., GPQA, LiveCodeBench, MATH-500, SciCode
This heterogeneous coverage is intended to ensure comprehensive training and evaluation for real-world document and video understanding tasks, as well as code and mathematical reasoning.
2. Data Sources and Collection Methodology
The dataset compiles samples from multiple sources, integrating both organic and synthetic data streams:
- Public Web-Scale Crawls:
- CommonCrawl PDF samples undergo human-verified OCR (via NVPDFTex) and multilingual translations (mBART).
- Wikimedia dumps contribute sourced text transcripts.
- Established Vision-Language Corpora:
- Image data from OpenImages, TextCaps, TextVQA, PixMo-cap
- Visual QA from VQAv2, OK-VQA, GQA, CLEVR, ScienceQA, OCR-VQA, etc.
- Document/Table datasets: DocLayNet, SynthTabNet, TextOCR, TabRecSet, PubTables-1M, FinTabNet
- Chart: ChartQA, InfoVQA, FigureQA, PlotQA, AI2D, UniChart, SimChart9K
- GUI: ScreenQA, WaveUI-25K
- Video and Multi-Image Sources:
Incorporation from benchmarks such as YouCook2, VaTeX, Localized Narratives, EgoExoLearn, TVQA, CLEVRER, Perception Test, Ego4D, ActivityNet.
- Synthetic and Model-Augmented Contributions:
- Synthetic table and data visualizations via pipelines (e.g., SynthDog-en, NVPDFTex)
- Synthetic QA pairs by LLMs Qwen2.5-VL, GLM-4.x, Qwen3 in under-labeled domains
This pipeline allows the dataset to reflect a distinctive balance between annotated benchmarks and rich, semi-automated generation for coverage expansion.
3. Annotation and Preprocessing Protocols
Multiple annotation schemas and preprocessing routines are applied to standardize and optimize the multi-modal data:
- Annotation Schema:
- OCR: bounding boxes with transcript text
- Chart/Table: cell coordinates with structure graphs
- QA: (question, answer) pairs using both MCQ and open formats
- Visual Grounding: referring expressions paired with target bounding boxes
- Human Verification:
- Spot-checks for OCR and synthetic pipeline outputs
- Sanity filters eliminate low-confidence or corrupted records
- Synthetic Augmentation:
- LLM-generated step-by-step ("chain-of-thought") traces for STEM
- Template-driven QA generation
- Preprocessing Pipelines:
- Images: sliced into 512×512 tiles (maximum 12 tiles), plus a 512×512 thumbnail; pixel-shuffle downsampling from 1024→256 tokens/tile
- Videos: sampled at 2 fps, capped at 128 frames, with uniform sampling if duration > 64 seconds
- Text: SentencePiece/BPE with 32K vocabulary; online sequence packing via balance-aware buffered strategy
- SFT: Loss square-averaging to address sequence length bias
4. Dataset Splits and Stratification
Official splits allocate data as follows:
| Split | Count |
|---|---|
| Training | 7,600,000 |
| Validation | 200,000 |
| Test | 200,000 |
Split stratification preserves modality ratios and task category breadth in each partition, with additional domain-specific holdouts:
- 10% of charts (ChartQA, PlotQA) reserved for a "chart-domain" test subset
- 10% of CommonCrawl PDF samples held out specifically for long-context document evaluation
This split structure supports generalization analyses both globally and within challenging content types.
5. Statistical Summaries
Detailed statistical characterization includes:
| Modality | Approximate Samples |
|---|---|
| Document images | 3.6 million |
| Graphics | 2.4 million |
| Video clips | 1.6 million |
| Text-only | 0.4 million |
Additional properties:
- Average document sequence length: tokens
- Average video duration: mean s, standard deviation s
- Average frames per video: frames
The dataset split equation holds:
6. Usage Guidelines and Licensing
- License: CC-BY-NC-4.0, restricting use to non-commercial research and fine-tuning.
- Recommended Preprocessing:
- Resize images/tiles to 512×512 before processing with the vision encoder
- Sample videos at 2 fps, cap at 128 frames, and consider EVS pruning at 70–80% for efficient inference
- Apply the provided SentencePiece (32K vocab) tokenizer and pixel-shuffle token reduction
- Best Practices:
- Utilize online sequence packing with balance-aware batch strategies for fine-tuning or evaluation
- Retain bounding box/segmentation annotations for grounding or structure extraction tasks
- Aggregate OCR outputs via Nemo Retriever Parse for multi-page or long-document tasks, maintaining original order
- Employ loss square-averaging to avoid disproportionate influence of atypical sequence lengths during supervised fine-tuning
These guidelines ensure that downstream users of the Nemotron-VLM-v2 Dataset can reproduce the data handling strategies underlying Nemotron models, and that benchmarking or fine-tuning studies can sustain consistency with the published baseline (NVIDIA et al., 6 Nov 2025).