Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPIC: A Giant Permissive Image Corpus for Visual Generation

Published 28 May 2026 in cs.CV and cs.AI | (2605.30341v1)

Abstract: Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-LLM, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

Summary

  • The paper introduces GPIC, a fully permissive image corpus with 100M training images to enhance reproducibility and benchmarking in generative modeling.
  • The methodology employs a rigorous four-stage pipeline with aggressive licensing filters, VLM-driven captioning, and conservative deduplication to ensure data quality and legal compliance.
  • Benchmark evaluations leverage FD-DINOv2, aligning closely with human perception to avoid metric saturation seen in legacy measures like FID.

GPIC: A Giant Permissive Image Corpus for Visual Generation

Motivation and Benchmark Landscape

The paper addresses persistent challenges in visual generative modeling research focused on reproducibility, data accessibility, and benchmarking rigor. Legacy benchmarks like ImageNet-1K have produced saturation effects in standard metrics (notably FID), driven by evolutionary "Goodharting" where models optimize for the metric rather than perceptual quality. Proprietary datasets exacerbate reproducibility barriers, and datasets distributed as URL indices suffer severe instability due to link rot. The GPIC corpus is introduced to satisfy four critical attributes essential for contemporary benchmarking: permissive licensing, dataset stability, scale, and accessibility.

Dataset Construction Pipeline and Quality Assurance

GPIC consists of 28 trillion pixels from 100M training, 200K validation, and 1M test image-caption pairs, each carefully curated for permissible research and commercial usage. Images are crawled from Flickr and Wikimedia, with strict filtering to retain only fully permissive licenses (CC BY, CC0, Public Domain, and No-Known-Restrictions). Attribution strings, provenance, and license information are maintained as explicit metadata.

A four-stage pipeline is developed for GPIC construction:

  1. Source Pooling and Licensing: Aggressive licensing filtration ensures downstream legal compliance for all dataset artifacts.
  2. Image Filtering: VLM-based quality filters eliminate poor-resolution, blank, or semantically vacuous images and apply conservative safety screening.
  3. Deduplication: Approximate nearest-neighbor search in SSCD feature space is used to remove clear duplicates and large near-copy clusters, but the thresholding is purposely conservative due to the cost of permissive data acquisition and the visual diversity in large-scale web datasets (Figure 1). Figure 1

    Figure 1: Four-stage GPIC dataset construction pipeline, including licensed image acquisition, quality filtering, deduplication, and VLM-driven captioning.

Deduplication rules are stratified: only images in clusters with similarity 0.9625\geq 0.9625, or clusters with at least five members above $0.9$, are removed. The process is validated by exhaustive manual inspection and byte-level SHA-256 collision checks.

Captioning Methodology and Microbenchmark Evaluation

Captioning is performed using Qwen3-VL-4B-Instruct, selected via microbenchmark analysis to maximize throughput-quality tradeoff, assessed over five axes (summary quality, counting, spatial relations, attribute binding, OCR) and multiple caption lengths. GPIC captions are generated in four formats (tag, short, medium, long), with distribution set at 1%, 45%, 45%, and 9% respectively. The model achieves near-parity with the largest VLMs in microbenchmark quality scores but offers substantially higher throughput, enabling large-scale corpus generation (Figure 2). Figure 2

Figure 2: Captioning quality and throughput comparison for Qwen3-VL-Instruct variants on GPIC microbenchmark.

All prompts and judging criteria are released, facilitating reproducible benchmarking and transparent caption quality assessment. Human refinement and annotation are employed to develop robust reference captions for microbenchmarking.

Benchmarking Protocol and Metric Redesign

The evaluation protocol is explicitly designed to avoid metric saturation and overfitting artifacts. The primary metric is FD-DINOv2, the Fréchet Distance computed in DINOv2 feature space, which remains unsaturated on current image synthesis models unlike the traditional FID. FD-DINOv2 is strongly aligned with human perceptual judgments and is computed against a held-out 1M-image test set—not the train set—providing stricter memorization detection (Figure 3). Figure 3

Figure 3: FID saturation on ImageNet-1K contrasted with sustained discriminative power of FD-DINOv2.

Secondary metrics include Precision, Density, Recall, and Coverage. FD-DINOv2 variants are discussed, with ablations across backbone size and register token inclusion demonstrating shifts in metric scale but strong inter-variant model ordering consistency.

Protocol compliance guidelines are stringent: use of FD-DINOv2-aligned objectives in model training is discouraged and must be clearly reported, as such optimization targets the metric directly rather than generative performance. Auxiliary models trained on datasets outside GPIC (e.g., DINOv3, SigLIP) should be explicitly disclosed. Prompt rewriting, alternative captioning, or text embedding deviations are also to be reported for interpretability.

Baseline Experiments

A reference baseline is established with JiT-T2I (PixGen-XXL/16 1.1B), trained for one epoch on GPIC-Full. The model architecture is a pixel-space flow-matching Transformer, conditioned using Qwen3-1.7B embeddings. Experiment details are rigorously described, with random cropping, square resizing, and classifier-free guidance sampling. The baseline achieves a best FD-DINOv2 of 76.25 at CFG=6.25 on the held-out test captions, with coverage and recall metrics improving with higher CFG. This baseline, though not performance-optimized, is released to anchor future comparative studies (Figure 4). Figure 4

Figure 4: Qualitative JiT-T2I generated samples across object-centric and scene-level prompts, juxtaposed with real test images and corresponding text prompts.

Implications, Limitations, and Future Directions

GPIC enables transparent, reproducible benchmarking for visual generative models at scale, addressing critical gaps left by proprietary or unstable datasets. The permissive licensing and centralized hosting remove legal and practical barriers to broad adoption for academia and industry. The corpus is also highly extensible to multimodal and text-driven tasks, supporting the broader trajectory of vision-and-language modeling.

Safety and societal risk mitigation measures are explicitly embedded in dataset construction: image and caption filtering, metadata retention, and frozen shard packaging reduce toxic content release, memorization risks, and dataset drift. Some residual near-duplicates remain, but their estimated prevalence is low given the deduplication strategy.

Theoretically, GPIC positions the community to reevaluate calibration and discrimination of generative metrics, particularly considering evolving model capabilities and the potential for metric-specific optimization. The non-saturation property of FD-DINOv2 provides a more robust evaluative landscape versus legacy Inception-based metrics.

Future work should explore further human-alignment studies for metric variants, the impact of external auxiliary representations on benchmarking comparability, and applications beyond text-to-image—such as video generation and multimodal control.

Conclusion

GPIC is a rigorously constructed, fully permissive 100M-image text-caption corpus designed to advance reproducible and scalable research in visual generative modeling. Through robust filtering, deduplication, VLM-based captioning, and carefully designed evaluative metrics, GPIC overcomes significant prior limitations in data accessibility and benchmark rigor. The corpus, protocol, and reference models are publicly released for open comparative research. GPIC is positioned as a foundational resource for evolving generative modeling methodologies and benchmarking standards.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper introduces GPIC (Giant Permissive Image Corpus), a huge collection of 100 million images with matching text descriptions (captions). It’s built to help researchers fairly train and test image‑generation AI models (the kind that turn text prompts into pictures). GPIC is designed to be legal to use, stable over time, big enough for modern models, and easy to download.

What questions were the authors trying to answer?

In simple terms, the authors asked:

  • How can we build a modern, legal, and stable image dataset that anyone (schools, labs, companies) can use to train and compare image‑generating AIs?
  • Can we provide better, clearer ways to judge these AIs so scores really reflect image quality and variety—not just “teaching to the test”?
  • What does a transparent, realistic training + testing setup look like today?

How did they build GPIC?

To make GPIC, the team followed a careful pipeline, like organizing a massive photo album that’s clean, legal, and well‑labeled.

Finding legally usable images

They collected images from sites like Flickr and Wikimedia that allow reuse (licenses like CC BY, CC0, and public domain). “Permissive” means the images can be used for both research and commercial projects. They kept info like the license and who to credit for each image.

Cleaning and safety filtering

They removed images that were too small, too blurry, nearly blank, or otherwise poor quality. They also used an AI to flag images that might be unsafe, and removed those. Think of this like tossing out photos that are too dark, out of focus, or inappropriate.

Removing duplicates

Lots of online images are copies or near‑copies (like taking several shots of the same scene). The team used a “copy detection” method to find similar images and remove clear duplicates—like using a smart “find similar photos” feature on your phone. They did this carefully to keep truly different images (even if they’re similar), while cutting out obvious repeats.

Writing captions for every image

Most web images don’t have good descriptions. So the team used a strong vision‑language AI (Qwen3‑VL‑4B) to write captions for every image. They used four styles:

  • Tags (short keyword lists)
  • Short captions
  • Medium captions
  • Long, more detailed captions Most captions are short or medium so they’re fast and useful for training.

Splits and packaging (making it easy to use)

They split the dataset into training (100M images), validation (200K), and test (1M). They also made smaller versions:

  • GPIC‑Nano: 1M images
  • GPIC‑Lite: 10M images Everything is hosted on Hugging Face in balanced chunks (“shards”), so you can download and stream them easily without special tools.

How do you judge image‑generating models fairly?

The authors propose a better evaluation method:

  • Instead of using the old FID score (which many models have already “maxed out” on older datasets), they use a newer approach based on DINOv2 features. Think of DINOv2 as a modern way for a computer to “look” at images more like a human would.
  • Their main score is FD‑DINOv2 (a distance between real and generated images in DINOv2’s “vision space”). Lower is better.
  • They also report Precision/Recall and Density/Coverage to capture both faithfulness (images look real) and diversity (images aren’t all the same).
  • Importantly, they compare generated images to a separate, held‑out test set (1M images), not the training set. That helps catch cheating or memorization.

To make comparisons fair, they provide:

  • A fixed set of 50K test captions: everyone runs their model on the same prompts.
  • Precomputed test statistics and a public evaluation toolkit.

What did they find, and why does it matter?

Here are the main takeaways:

  • GPIC meets four goals at once: it’s permissive (legal), stable (doesn’t change behind your back), large (100M images ≈ 28 trillion pixels), and accessible (easy to download and use).
  • Captions are high‑quality and varied (from simple tags to detailed descriptions), making the dataset better for today’s text‑to‑image models.
  • Their evaluation method (FD‑DINOv2) stays informative, unlike older scores that are now “saturated.” It also lines up better with what humans think looks good.
  • They release a simple, transparent baseline model trained on GPIC. Its results show GPIC is challenging and useful for tracking improvements.
  • Everything is public: the dataset, the evaluation code, and the baseline. That encourages fair, reproducible science instead of secret, ever‑changing data.

Why does this work matter?

  • For students and researchers: GPIC makes it possible to train and compare image‑generating models on a large, stable, legal dataset without guessing what others used.
  • For companies: The permissive licenses reduce legal risk and make it easier to build on this work.
  • For the field: Better benchmarks lead to better models. By avoiding outdated tests and by using a held‑out test set, the results are more meaningful.
  • For safety and transparency: The dataset includes safety filtering, clear attribution, and a frozen release (not just a list of URLs that might disappear). That reduces data drift and hidden changes.

In short, GPIC is like a well‑organized, openly shared “mega‑library” of images and captions—designed for today’s AI image generation—plus a fair grading system and a starter model. It should help the community build better, safer, and more reliable image‑generating AIs.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide future research.

  • Dataset representativeness and bias: No quantitative audit of geographic, demographic, cultural, or category distributions (e.g., people, objects, scenes), nor analysis of source-platform (Flickr/Wikimedia) biases and their downstream impact on generative models.
  • Train/val/test leakage via near-duplicates: Deduplication is intentionally conservative and details do not explicitly confirm cross-split near-duplicate removal; the paper does not quantify residual near-duplicates across splits or their effect on evaluation.
  • Post-dedup residual duplicate rate: No corpus-level measurement (manual or automated) of remaining near-duplicate prevalence after the two-tier SSCD rule, nor ablations of thresholds vs. diversity/quality trade-offs.
  • License accuracy and “No-Known-Restrictions”: No audit of mislabelled licenses at this scale or clarity on the legal basis and due diligence for the “No-Known-Restrictions” category across jurisdictions.
  • Attribution compliance for CC BY: No guidance on how users should satisfy CC BY attribution at 100M scale (e.g., aggregated attributions, downstream model releases), or whether derived models incur attribution obligations.
  • Safety filtering scope and validity: Safety removal (~0.35%) lacks a disclosed taxonomy, thresholds, or precision/recall estimates; no human validation of false positives/negatives; unclear coverage of categories (e.g., sexual content, violence).
  • Privacy/PII risks: No explicit detection/removal policy for faces (including minors), identity documents, license plates, or other PII in images or captions (including OCR-extracted text).
  • Caption faithfulness at scale: Captions are generated by one VLM (Qwen3-VL-4B) with only a 1,520-image microbenchmark and LLM-as-a-judge scoring; no large-scale human audit or error-rate estimates (hallucinations, miscounts, spatial errors) across 100M images.
  • Caption style homogeneity: Reliance on a single captioner risks stylistic and linguistic homogeneity; no ablations with multiple VLMs or human-mixed captions to assess generalization and overfitting to caption style.
  • Caption-type mixture choice: The 1%/45%/45%/9% tag/short/medium/long mix is heuristic; no experiments probing how this ratio affects model learning, prompt faithfulness, or sample diversity.
  • Content coverage statistics: Absent taxonomic breakdowns (objects, scenes, activities, long tail) to guide targeted data curation or interpret model failures by domain.
  • Multilingual support: Captions appear English-only; no multilingual variants or evaluation, limiting research on multilingual text-to-image generation.
  • Watermarks and low-quality artifacts: No reported prevalence of watermarks, heavy compression, overlays, or manipulated images post-filtering; no watermark detection or corruption statistics.
  • Evaluation ignores text–image alignment: Primary metrics (FD-DINOv2, Precision/Recall, Density/Coverage) are marginal distribution metrics and do not evaluate prompt fidelity; no CLIPScore/TIFA/human studies for conditional alignment.
  • Human correlation on GPIC: No user studies validating that FD-DINOv2 correlates with human judgments of quality and faithfulness for GPIC’s text-conditioned setting.
  • Metric contamination risk: DINOv2 may have been trained on overlapping web data; no empirical overlap audit with GPIC Test-1M; the degree to which familiarity inflates FD-DINOv2 remains unquantified.
  • Adversarial optimization of the metric: Beyond discouragement/disclosure, no safeguards, secondary primary metric, or leaderboard policy to mitigate optimization to DINOv2 features.
  • Baseline scope and scaling: Only a single pixel-space flow model trained for one epoch at 256×256; no comparisons to latent diffusion/DiT/AR models, no scaling laws, no Nano/Lite vs. Full ablations, and no higher-resolution studies.
  • Aspect ratio handling: Baseline uses square crops at 256×256; no analysis of aspect-ratio preservation or resolution scaling vs. performance relative to GPIC’s resolution distribution.
  • Overlap with other datasets/benchmarks: No measurement of GPIC’s overlap with common training/evaluation sets (e.g., LAION, ImageNet) to assess contamination and generalization.
  • Test-set integrity: Full 1M-image test set is public; no mechanism to prevent training on test images (e.g., hidden test servers), risking benchmark gaming.
  • Reproducibility of captioning: Caption generation took ~1,500 H100-hours, but it is unclear if pinning seeds/versions yields identical captions; no release of n-best candidates/logits or a deterministic reproduction protocol.
  • Dataset governance and lifecycle: No versioning/change-log plan, takedown policy (e.g., for mislicensed or sensitive items), or mirroring strategy beyond a single hosting provider.
  • Provenance accountability vs. stability: Removal of source URLs aids stability but impedes independent license verification and subject notification; no alternative mechanism for provenance checks or updates.
  • Legal jurisdiction nuances: CC BY/PD interpretations vary by country; the paper does not discuss cross-jurisdictional risks for commercial deployment using GPIC.
  • Broader multimodal benchmarks: Although GPIC is positioned for broader multimodal research, no standardized tasks or baselines are provided for retrieval, captioning, or VQA on GPIC.
  • Environmental and cost considerations: Limited analysis of compute/storage costs and practical guidance for low-resource users beyond Nano/Lite tiers (e.g., streaming bandwidth, shard access patterns).

Practical Applications

Immediate Applications

Below are concrete, deployable uses that leverage GPIC’s dataset, pipeline, and evaluation protocol today.

  • Train and fine‑tune license‑clean text‑to‑image models for commercial products
    • Sector: software, media/entertainment, advertising, e‑commerce
    • How: Stream GPIC shards (Hugging Face) into existing training stacks; use provided captions (tag/short/medium/long) for flexible conditioning; start on GPIC‑Nano (1M) or GPIC‑Lite (10M), scale to GPIC‑Full (100M)
    • Tools/workflows: Hugging Face datasets streaming; shards (8,000 tars); CFG sampling; reference JiT‑T2I config for a starting point
    • Assumptions/dependencies: Sufficient compute/storage (≈12.9 TB for GPIC‑Full); performance at higher resolutions may require further training; attribution handling for CC BY content if needed downstream; dataset biases from Flickr/Wikimedia
  • Reproducible benchmarking and model selection using FD‑DINOv2 on a held‑out test set
    • Sector: AI labs, academia, enterprise AI procurement
    • How: Generate 50K images from fixed GPIC test captions; compute FD‑DINOv2, Precision/Recall, Density/Coverage against released 1M‑test statistics
    • Tools/workflows: gpic‑eval (PyTorch evaluation suite); DINOv2 features; published test statistics on Hugging Face
    • Assumptions/dependencies: Avoid optimizing directly on DINOv2 features per protocol; disclose deviations; compute for generating 50K samples per model
  • Rapid prototyping and ablations with tiered datasets
    • Sector: startups, research groups
    • How: Use GPIC‑Nano/Lite for quick iteration, hyperparameter tuning, and architectural comparisons; swap shard ranges to scale experiments consistently
    • Tools/workflows: Shard‑based tiering; identical source/caption distributions across tiers
    • Assumptions/dependencies: Consistent shard selection for reproducibility; storage/bandwidth for tier downloads
  • License‑aware data curation for internal corpora
    • Sector: data engineering/platforms
    • How: Replicate GPIC’s pipeline—license filtering, VLM‑based quality/safety filtering, and SSCD+FAISS deduplication—for proprietary or domain datasets
    • Tools/workflows: Qwen3‑VL‑4B filters; SSCD descriptors; FAISS; conservative two‑tier dedup rules (0.90 graph, >0.9625 pair removal; keep only highest‑res in connected components ≥5)
    • Assumptions/dependencies: Access to VLM inference (vLLM/SGLang) and GPU capacity; threshold calibration for specific domains; quality/safety policy alignment
  • Caption enrichment for existing image assets
    • Sector: e‑commerce (product catalogs), DAMs, education
    • How: Apply the paper’s short/medium/long/tag prompting templates with Qwen3‑VL‑4B to generate multi‑granularity captions for search, retrieval, and training
    • Tools/workflows: Provided caption prompts; VLM‑inference stack; optional LLM‑as‑judge scoring
    • Assumptions/dependencies: Caption quality varies by domain; English‑centric captions by default; GPU hours and batching strategy affect throughput
  • Synthetic data augmentation for discriminative models
    • Sector: retail (classification/search), content moderation, mapping
    • How: Train or fine‑tune generators on GPIC and synthesize class/attribute‑balanced images using tags/short captions to improve minority coverage or stress‑test classifiers
    • Tools/workflows: GPIC‑trained generators; diversity metrics (Recall/Coverage) to verify augmentation utility
    • Assumptions/dependencies: Gains depend on domain match; risk of reinforcing dataset biases; ensure labeling alignment
  • Safer dataset release and training operations
    • Sector: MLOps/data governance
    • How: Adopt “frozen tar shards” release pattern to prevent silent drift and poisoning; maintain per‑sample licensing/attribution metadata
    • Tools/workflows: Sharded packaging; attribution strings; SHA‑256 duplicate checks
    • Assumptions/dependencies: Organizational processes for provenance retention; legal review for dataset redistribution
  • Standardized model reporting for publications and model cards
    • Sector: academia, AI vendors
    • How: Report FD‑DINOv2 on GPIC with 50K fixed captions and held‑out statistics; document auxiliary model usage per protocol
    • Tools/workflows: gpic‑eval; published “oracle reference” real‑vs‑real distances to contextualize results
    • Assumptions/dependencies: Community compliance; disclosure of DINOv2‑aligned training if used
  • Teaching and skills development
    • Sector: education
    • How: Use GPIC‑Nano/Lite in courses for data pipelines, generative modeling, evaluation, and reproducibility
    • Tools/workflows: Shards + evaluation suite; caption prompts/microbenchmark
    • Assumptions/dependencies: Storage and bandwidth availability for students/institutions

Long‑Term Applications

These opportunities require further research, scaling, domain adaptation, or ecosystem adoption.

  • Sector‑specific permissive datasets using the GPIC pipeline
    • Sector: healthcare (non‑PII medical imagery), manufacturing (inspection), cultural heritage, geospatial
    • What: Build licensed, stable, domain‑aligned corpora with VLM‑based quality/safety filters and conservative dedup
    • Tools/workflows: Domain‑tuned safety filters; specialized VLMs; license vetting; shard packaging
    • Assumptions/dependencies: Availability of permissively licensed domain data; stronger safety/taxonomy requirements; expert annotation for high‑risk domains
  • Video and 3D/multimodal corpora with analogous permissive, stable design
    • Sector: film/VFX, robotics, AR/VR, simulation
    • What: Extend GPIC principles to video/temporal data and 3D assets; enable training/evaluation of generative video or 3D models with held‑out metrics
    • Tools/workflows: License‑vetted source pools; temporal safety filters; scalable captioning/storyboarding VLMs
    • Assumptions/dependencies: Video licensing complexities; much higher storage/compute; evaluation metric maturation
  • Open, certifiable standards for training‑data compliance and audits
    • Sector: policy/regulation, enterprise compliance
    • What: “Trained on permissive corpus” certifications; lineage tracking that proves license scope and provenance at scale
    • Tools/workflows: Dataset manifests, per‑sample metadata, audit APIs, internal dashboards
    • Assumptions/dependencies: Regulator and industry adoption; third‑party auditors; harmonized license interpretations
  • Next‑gen open‑weight generators trained entirely on permissive data
    • Sector: enterprise AI, creative tools
    • What: Foundation T2I/T2V models that are legally uncomplicated to deploy commercially
    • Tools/workflows: Large‑scale streaming training on GPIC; higher‑res curricula; distillation/compression for deployment
    • Assumptions/dependencies: Compute budgets; product‑grade guardrails and RLHF; potential need for supplemental high‑res or domain‑specific images
  • Multimodal grounding and agent systems leveraging structured captions
    • Sector: robotics, assistive tech, AR/vision assistants
    • What: Use GPIC’s hierarchical captions (tag→long) to pretrain vision‑language components for spatial/attribute grounding
    • Tools/workflows: Curriculum learning with caption lengths; integration with action/policy models
    • Assumptions/dependencies: Need for task/embodiment‑specific data; bridging sim‑to‑real gaps; additional annotations (e.g., depth, actions)
  • Bias and safety measurement frameworks anchored to held‑out evaluation
    • Sector: academia, policy, platform safety
    • What: Develop standardized stress tests and fairness audits for generators using FD‑DINOv2‑based suites plus human studies
    • Tools/workflows: Expanded caption test sets targeting protected attributes; audit reporting schemas
    • Assumptions/dependencies: Additional labeling and demographic taxonomies; consensus on fairness metrics and procedures
  • Evaluation‑as‑a‑service and dataset streaming services
    • Sector: cloud/ML platforms
    • What: Managed endpoints for GPIC streaming, 50K‑caption evaluation runs, and metric dashboards
    • Tools/workflows: Hosted gpic‑eval; autoscaling inference/training clusters
    • Assumptions/dependencies: Cost‑effective cloud economics; privacy/security for proprietary model submissions
  • On‑device and edge‑deployed generators trained on GPIC‑Lite
    • Sector: mobile, creative apps, games
    • What: Compressed models distilled from GPIC‑trained systems for offline use
    • Tools/workflows: Quantization, pruning, distillation; mobile inference runtimes
    • Assumptions/dependencies: Maintaining image quality under tight compute/memory; energy constraints
  • Data‑centric research on metrics beyond DINOv2 and robustness to “metric‑gaming”
    • Sector: academia/standards bodies
    • What: New perceptual metrics, human‑aligned benchmarks resistant to over‑optimization
    • Tools/workflows: Larger held‑out sets; multi‑facet scoring (aesthetics, grounding, diversity)
    • Assumptions/dependencies: Community consensus; scalable human evaluation pipelines
  • Federated training with GPIC as public backbone + private data adapters
    • Sector: enterprises with proprietary catalogs
    • What: Combine GPIC pretraining with private fine‑tuning via adapters/LoRA in privacy‑preserving setups
    • Tools/workflows: Federated learning stacks; parameter‑efficient fine‑tuning; secure aggregation
    • Assumptions/dependencies: Legal and privacy frameworks; domain shift management; heterogeneous data quality

Cross‑cutting assumptions and dependencies

  • Compute and storage: GPIC‑Full is ≈12.9 TB and training at scale typically requires multi‑GPU clusters.
  • Licensing and attribution: Although GPIC is permissive (CC BY/CC0/Public Domain/No‑Known‑Restrictions), CC BY may require attribution handling in some deployments; retain metadata.
  • Safety and quality: VLM‑based filters reduce but do not eliminate problematic content; organizations may need stricter domain‑specific filters.
  • Bias and representativeness: Flickr/Wikimedia skew may underrepresent certain geographies, demographics, or modalities; validate before critical deployments.
  • Metric usage: FD‑DINOv2 should not be directly optimized without disclosure; results are most meaningful when adhering to the evaluation protocol.
  • Resolution and detail: Average image resolution (~479×587) may limit ultra‑high‑res generation without additional data or upsampling strategies.

Glossary

  • AdamW: A variant of the Adam optimizer that decouples weight decay from the gradient update for better regularization. "We use AdamW with learning rate 10410^{-4}, betas 0.9 and 0.95, and no weight decay."
  • approximate nearest-neighbor search: A fast method to find close items in high-dimensional spaces without exact comparisons, often used for large-scale similarity search. "and use FAISS for approximate nearest-neighbor search."
  • attribute binding: The capability of a model to correctly associate attributes (like color or position) with the correct objects in an image. "best spatial understanding and attribute binding scores (1.60 and 1.55);"
  • classifier-free guidance (CFG): A sampling technique for conditional generative models that adjusts the strength of conditioning without a separate classifier. "We sample with Euler sampling using 50 steps and evaluate classifier-free guidance (CFG) scales 1.75, 4.0, and 6.25."
  • connected component: A set of nodes in a graph where each node is reachable from any other within the set; used here to group near-duplicate images. "for connected components containing at least five images, we keep only the highest-resolution image in the component,"
  • copy detection features: Visual descriptors designed to identify copies or near-copies of images under transformations. "deduplicate images using similarity scores derived from SSCD~\cite{pizzi2022self} copy detection features (Stage 3)"
  • Coverage (metric): A diversity metric for generative models measuring how much of the reference distribution is covered by generated samples. "We also report Precision and Density, which measure fidelity, and Recall and Coverage, which measure diversity"
  • deduplication: The process of identifying and removing duplicate or near-duplicate items in a dataset. "We run SSCD-based deduplication on six subsets ranging from 108K to 3.4M images, across thresholds θ{0.75,0.80,0.85,0.90,0.95}\theta\in\{0.75,0.80,0.85,0.90,0.95\}."
  • Density (metric): A fidelity-oriented metric assessing how well generated samples fall within dense regions of the reference distribution. "We also report Precision and Density, which measure fidelity, and Recall and Coverage, which measure diversity"
  • DINOv2: A self-supervised vision model whose features are used for evaluation metrics more aligned with human perception. "On GPIC, we evaluate generated images using metrics computed over DINOv2 features."
  • DINOv3: A newer large-scale self-supervised vision model; use of its features during training can bias evaluation on DINO-based metrics. "such as DINOv3 \cite{simeoni2025dinov3} or SigLIP \cite{zhai2023sigmoidlosslanguageimage}."
  • Euler sampling: A numerical sampler used for generative diffusion/flow models, specifying the integration steps during image generation. "We sample with Euler sampling using 50 steps"
  • FAISS: A library for efficient similarity search and clustering of dense vectors at scale. "and use FAISS for approximate nearest-neighbor search."
  • FD-DINOv2: A Frechet Distance metric computed over DINOv2 features, used as the primary evaluation metric in GPIC. "We recommend DINOv2 features because ImageNet-1K FID is saturated, while FD-DINOv2 remains informative for current models."
  • FID: Fréchet Inception Distance, a standard metric for evaluating generative models that has become saturated on some benchmarks. "ImageNet-1K FID is saturated, while FD-DINOv2 remains informative for current models."
  • flow matching: A generative modeling approach that learns continuous flows in data space, here used in pixel space for image generation. "JiT~\cite{li2025back}, a pixel-space flow matching model with a Transformer backbone."
  • Goodhart's law: The principle that once a measure becomes a target, it loses its effectiveness as a measure. "Goodhart's law: When a measure becomes a target, it ceases to be a good measure."
  • link rot: The phenomenon where URLs become inaccessible over time, undermining dataset stability and reproducibility. "which makes comparisons difficult due to link rot~\cite{datacomp, yfcc100m}."
  • LLM-as-a-judge: An evaluation setup where a LLM is used to score or judge outputs along specified criteria. "Each axis is scored on a 0--2 scale using an LLM-as-a-judge pipeline, and we also measure captioning throughput."
  • Maximum Mean Discrepancy: A non-parametric statistical distance between distributions used to evaluate generative models. "and Maximum Mean Discrepancy as a non-parametric alternative to FD."
  • MoE (Mixture of Experts): A model architecture where inputs are routed to specialized sub-networks (experts), often sparsely activated for efficiency. "We evaluate Qwen3-VL-Instruct models at 2B, 4B, 8B, and 30B-A3B (sparse MoE) on this benchmark."
  • OCR: Optical Character Recognition, the task of extracting text from images. "We score captions along five axes: overall summary quality, counting accuracy, spatial understanding, attribute binding, and OCR."
  • Power law: A functional relationship where one quantity varies as a power of another, often used for extrapolating scaling behavior. "and fit a power law D(N)=ANβD(N)=AN^\beta to predict removals at full scale."
  • Precision (metric): A fidelity metric indicating how closely generated samples match the support of the reference distribution. "We also report Precision and Density, which measure fidelity, and Recall and Coverage, which measure diversity"
  • Recall (metric): A diversity metric showing how well generated samples cover the variety present in the reference distribution. "We also report Precision and Density, which measure fidelity, and Recall and Coverage, which measure diversity"
  • SGLang: An inference/serving framework optimized for efficient generation with large language/vision models. "and support efficient inference through standard serving frameworks such as vLLM~\cite{kwon2023efficient} and SGLang~\cite{zheng2024sglang}."
  • SHA-256: A cryptographic hash function used here to detect exact duplicate files. "We also verify that no exact duplicates remain by computing SHA-256 hashes over image file bytes."
  • shard: A partition of a large dataset into manageable chunks to facilitate storage and streaming training. "We package GPIC as tar shards containing images, captions, and metadata, and release the dataset on Hugging Face with documentation."
  • SigLIP: A language-image pretraining approach; using its features during training may confer unfair advantages in evaluation. "such as DINOv3 \cite{simeoni2025dinov3} or SigLIP \cite{zhai2023sigmoidlosslanguageimage}."
  • SSCD: A self-supervised descriptor for image copy detection used to identify near-duplicate images. "Specifically, we extract SSCD features~\cite{pizzi2022self} for all images and use FAISS for approximate nearest-neighbor search."
  • URL index: A dataset format that lists source URLs instead of hosting files directly, which can be unstable over time. "Many modern image datasets are distributed as URL indices, which makes comparisons difficult due to link rot~\cite{datacomp, yfcc100m}."
  • URL-level data poisoning: Malicious manipulation of web-hosted data that can contaminate datasets built from URL lists. "eliminating silent dataset drift, exposure to URL-level data poisoning, and the need to re-scrape source images outside our filtering pipeline."
  • vLLM: An optimized serving system for large language and vision-LLMs aimed at high-throughput inference. "and support efficient inference through standard serving frameworks such as vLLM~\cite{kwon2023efficient} and SGLang~\cite{zheng2024sglang}."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 417 likes about this paper.