FineVision Dataset for VLM Research

Updated 27 November 2025

FineVision is a large-scale, curated open-access corpus unifying 24.3 million samples from over 200 datasets into 185 standardized subsets for VLM research.
It employs a semi-automated, human-in-the-loop pipeline for schema standardization, cleaning, de-duplication, and contamination filtering, ensuring high data quality.
Models trained on FineVision show significant benchmark improvements, highlighting the benefits of scale, stringent data hygiene, and human-verified annotations.

FineVision is a large-scale, open-access corpus specifically designed to address data fragmentation and contamination in vision–LLM (VLM) research. Comprising 24.3 million samples unified from more than 200 public datasets into 185 standardized subsets, FineVision establishes a single, rigorously curated foundation for both the training and evaluation of VLMs, including conversational, grounding, document QA, scientific/technical, chart/table, and agentic GUI automation tasks. Curation is enforced via a semi-automated, human-in-the-loop pipeline emphasizing schema standardization, comprehensive cleaning, de-duplication, and de-contamination against public test sets. Models trained on FineVision consistently exceed prior open-mixture baselines in benchmark performance, demonstrating the empirical benefits of scale, data hygiene, and human-verified annotation (Wiedmann et al., 20 Oct 2025).

1. Corpus Architecture and Composition

FineVision subsumes diverse modalities and task types unified under a common schema:

Scale & Statistics:

| Metric | Value | |--------------------------------------------|--------------------------| | Total samples | 24.3 million | | Total images | 17.3 million | | Total user/assistant turns (conversations)| 88.9 million | | Total answer tokens | 9.5 billion | | Unique public sources | >200 | | Unified canonical subsets | 185 |

Modalities and Tasks:

Image–text conversations: Encompasses visual question answering (VQA), captioning, referring expression grounding, document and OCR QA, chart/table interpretation, science QA, and general image classification tasks.
Agentic/GUI automation trajectories: Includes datasets for mobile and desktop UI automation.

Data Schema: Each entry is a JSON object comprising images ([URL]/byte-encodings), an ordered list of texts ({role: "user"/"assistant", content}), original source identifiers, and rich metadata (including bounding-boxes, confidence, original annotations, and quality scores).
Agentic Action-Space: All GUI automation tasks are mapped to a single, typed Python-style API including actions such as click(x: float, y: float), long_press(x: float, y: float), swipe(from_xy: (float,float), to_xy: (float,float)), type(text: str), open_app(app_name: str), navigate_back(), and wait(seconds: int).

2. Curation Pipeline and Methodology

The FineVision pipeline is structured into four principal stages, all supervised by human review at critical checkpoints:

Bulk Ingestion & Canonicalization: Acquisition of raw data from Hugging Face, GitHub, institutional repositories, and project websites. Dataset-specific extractor scripts transform and extract images plus annotations, subject to reviewer audit.
Schema Mapping & Conversion:
- Utilization of LLMs (e.g., Claude) for reverse-engineering source annotation semantics.
- Application of six core conversational templates across task families.
- Conversion scripts generate dry-runs; reviewers audit for semantic fidelity and diversity, remediating discovered issues iteratively.
Cleaning & Validation:
- Images: Robust decoding, elimination of corrupted/zero-byte files, EXIF orientation normalization, RGB conversion, and capping at a maximum side length of 2048px.
- Text: UTF-8 enforcement, control character stripping, punctuation normalization, collapsing repeated tokens, and capping turn length at 8192 tokens.
- Metadata: Comprehensive retention of coordinates, confidence, and licensing.
De-duplication & Contamination Filtering:
- All images are embedded via SSCD descriptors (Pizzi et al., CVPR ’22), with cosine similarity clustering (threshold τ = 0.95) for intra-corpus duplicate grouping.
- Cross-dataset de-contamination: 66 public benchmark test sets are embedded, and any training image with $\cos \geq \tau$ to a test instance is flagged for removal.
- Reviewer oversight covers all flagged clusters and potentially contaminated subsets.

Turn-level Quality Assessment: Automated scoring of (question, answer) pairs via LLM/VLM judges (Qwen3-32B for textual, Qwen2.5VL-32B for vision); criteria include formatting, relevance, visual dependency, and image-question correspondence on a 1–5 scale. Reviewer spot-checks ensure calibration of scripts.
Human Effort: Over 200 custom converter scripts reviewed; each dataset typically underwent 1–2 dedicated audit cycles. The full process for all 185 subsets spanned less than 6 months of cumulative cross-functional labor.

3. De-duplication and Benchmark De-contamination

FineVision enforces dataset hygiene through systematic duplication and contamination removal, crucial for valid benchmarking and generalization:

Descriptor and Similarity Functions: Each image is embedded using SSCD, and clusters are formed by thresholded cosine similarity ( $\tau = 0.95$ $τ = 0.95$ ).
- Intra-Dataset Deduplication (pseudocode):

for each image i in FineVision:
    e_i = SSCD_embed(i)
for i < j:
    if cosine(e_i, e_j) ≥ τ:
        assign i, j to same duplicate cluster

Cross-Benchmark De-contamination (pseudocode):

for each benchmark B_k (k=1…66):
    for each image t in B_k:
        e_t = SSCD_embed(t)
    for each train image i:
        if max_k cosine(e_i, e_t) ≥ τ:
            flag i as potentially contaminated

Empirical Contamination Rates:

| Dataset | % Overlap | Retrained Model Drop (pp) | |-------------------|-----------|---------------------------| | Cauldron | 3.05 | 2.8 | | LLaVA-Vision | 2.15 | 2.7 | | Cambrian-7M | 2.29 | 3.7 | | FineVision | 1.02 | 1.6 |

These rates indicate FineVision exhibits lower contamination relative to prior open mixtures, a factor confirmed by smaller evaluation drops under controlled decontamination (Wiedmann et al., 20 Oct 2025).

4. Safety, Diversity, and Quality Control

Strict data governance underpins FineVision’s curation philosophy:

Safety and Licensing: All unsafe or NSFW samples are filtered. The original dataset license is strictly preserved for every subset, disallowing re-licensing.
Diversity Metrics:
- Subsets are divided into nine supercategories (Captioning, Grounding, General VQA, Charts, OCR QA, Science, Math, Text-only, Chart/Table).
- Visual diversity is quantified using SSCD-derived statistics:
- Effective rank: $r_{eff} = \exp(H(p))$ where $H(p)$ is entropy over normalized spectrum.
- Participation ratio: ${\left(\sum_i \lambda_i\right)}^2 / \sum_i \lambda_i^2$

With $r_{eff} = 359.22$ , $PR = 182.52$ , FineVision displays both higher modality breadth and balance versus The Cauldron (324, 129), LLaVA-OneVision (267, 87), and Cambrian-7M (359.7, 152).

Quality Control:
- Quality scores per turn (1–5) show that 97.2% of turns achieve $\geq 4$ for formatting, and 85% for relevance.
- Visual dependency and image–question correspondence measures are used for analysis rather than aggressive filtering after ablation.
- At least 100 samples per subset are manually audited, with additional inspection for agentic datasets (including execution fidelity).

5. Evaluation Methodology and Results

FineVision’s utility is assessed via rigorous comparative experiments:

Model Training: Utilized SmolVLM 460M (SmolLM2-360M text backbone, SigLIP2-Base-512 visual encoder). Training conducted over 20,000 steps with batch size 512 on 32 × H100 GPUs (1 epoch, ≈20 hours).
Baselines: Compared against The Cauldron (2M images), LLaVA-OneVision (2.5–3.9M), and Cambrian-7M (5.4–7.1M).
Benchmark Suite: 11 lmms-eval datasets: AI2D, ChartQA, DocVQA, InfoVQA, MME, MMMU, ScienceQA, MMStar, OCRBench, TextVQA, SEED-Bench.
Performance:
- FineVision-trained models outperform all baselines after ~1 epoch.
- End-of-training mean improvement margins:
- +12.7 pp over The Cauldron,
- +5.1 pp over Cambrian-7M,
- +14.3 pp over LLaVA-OneVision.
- Under contamination-controlled re-training, FineVision’s performance drop (+1.6 pp) remains less severe than all baselines (2.7–3.7 pp decline).
Agentic / GUI Evaluation (ScreenSpot V2 & Pro, zero-shot and fine-tuned conditions for Smol-2B, Smol-0.5B, FV-0.5B):
- Zero-shot: All models score 0% on ScreenSpot-Pro.
- After fine-tuning: FV-0.5B Pro achieves 6% (vs. Smol-2B 7%, Smol-0.5B 1%), and FV-0.5B V2 obtains 48% (vs. Smol-2B 41%, Smol-0.5B 24%).

6. Access, Distribution, and Licensing

Availability:
- Corpus and turn-level quality scores: Hugging Face
- Demos: Hugging Face Space
- Curation tools, conversion scripts, and SSCD de-duplication: GitHub
- Pre-computed SSCD embeddings for 66 test sets: Hugging Face
Data Format: JSON-lines, fields for images, texts (with roles and content), and metadata; action-space schema in Python stubs.
Licensing: Each subset retains its original license terms; redistribution requires compliance with source licensing.

Full documentation, including schema, system prompts, and quality rating definitions, is provided in the appendix of the primary reference (Wiedmann et al., 20 Oct 2025).

By consolidating 185 source subsets into a thoroughly de-duplicated, de-contaminated, and quality-controlled corpus at unprecedented scale and diversity, FineVision establishes a new standard for open research in data-centric vision–language and agentic model development.

PDF Markdown Chat (Pro)

References (1)

FineVision: Open Data Is All You Need (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to FineVision Dataset.