DataConcept: Concept-Aware Image-Text Dataset
- DataConcept is a large-scale, concept-annotated image–text dataset enriched with localized object and attribute annotations for precise multimodal pretraining.
- It employs a robust annotation pipeline that merges diverse concept sources and uses multi-resolution grounding to achieve both depth and breadth in visual concepts.
- Its Concept-Aware Batch Sampling (CABS) strategies, including diversity and frequency maximization, significantly improve zero-shot classification and image–text retrieval performance.
DataConcept refers to a large-scale, concept-annotated image–text dataset and the associated framework for task-adaptive, concept-aware data curation in multimodal pretraining. The DataConcept resource consists of 128 million web-crawled image–text pairs, each enriched with localized, fine-grained object and attribute annotations drawn from an open-vocabulary, hierarchical concept bank. It serves as both a foundation and a testbed for the development and evaluation of advanced batch sampling and training strategies in vision–LLMs, particularly Concept-Aware Batch Sampling (CABS), which enables direct, online control over the concept distributions seen by models during pretraining (Ghosh et al., 25 Nov 2025).
1. Dataset Scope and Construction
DataConcept is constructed by sampling 128M pairs from the DataComp XLarge pool (12.8B samples) without prior filtering. Each sample is an image–text pair . The annotation pipeline proceeds as follows:
- Concept-bank formation: The initial vocabulary ( entries) is a merge of RAM++ (“Rapid Automatic Multi-labeling”, ~4k items), V3Det classes, OpenImages labels, and mined concepts from Udandarao et al. The bank is normalized by lemmatization, syntactic cleaning, synonym coalescing (using WordNet and Sentence-Transformer cosine similarity), and explicit curation to remove unsafe or sensitive terms.
- Image-level tagging: RAM++ tags each image at 384×384 resolution with concepts above a confidence of 0.75, producing preliminary concept sets .
- Concept grounding: For each , the prefiltered concepts are passed as prompts to GroundingDINO at four resolutions to obtain bounding-box localizations, using text–box similarity and box confidence thresholds of 0.27. Weighted Box Fusion clusters predictions with IoU>0.29 and fuses them for stability.
- Box and concept post-processing: Only concepts that can be localized are retained, and all boxes are associated with detection confidence. This leads to a final set of 12,253 unique grounded concepts across the 128M samples.
- Concept-aware recaptioning: For each , Qwen2-VL-7B generates a synthetic caption incorporating the concept set. These recaptions are significantly longer (mean ~33 tokens) and cover a larger fraction of the annotated concepts than raw alt-text ().
Each DataConcept record is: where each is a bounding box, a concept id, and a confidence score.
2. Concept Annotation Schema and Taxonomy
Concepts are understood as objects or attributes that can be detected and localized in natural images. The taxonomy is hierarchical, encompassing entities from RAM++, OpenImages, and V3Det; it has 12,253 unique leaf concepts empirically present in the data. Annotation proceeds by model-based multi-label tagging, guided object detection, fusion and filtering operations. Coverage is both deep (capturing rare, fine-grained concepts) and broad (including frequent, generic classes), ensuring diversity and granularity suitable for modern vision-language research.
At the sample level, the number of annotated concepts per image is median 3, with a long right tail ( in some images). At the dataset level, concept frequencies exhibit a heavy-tailed distribution: 86 concepts appear in over 1M images, 5,326 in at least 1k.
3. Mathematical Formulation: Labels and Distributions
The mathematical representation of concept supervision in DataConcept is as follows:
- Each sample possesses a concept set .
- For each concept , frequency measures its global prevalence.
- The distribution of objects per sample and per batch underpins the batch sampling algorithms that interact with the dataset. For targeted pretraining, concept-label distributions can be explicitly controlled at sub-batch granularity.
4. Concept-Aware Batch Sampling (CABS) Framework
CABS implements online, deterministic, on-the-fly sub-batch construction with two main paradigms:
4.1 CABS-Diversity Maximization (CABS-DM)
The goal is to approximate a uniform per-concept frequency in each sub-batch, combating the skew inherent in large-scale, web-scraped data. The algorithm:
- Maintains a tally of the number of samples for each concept already included in the batch.
- Assigns each sample a gain score:
with target per-concept count for sub-batch size .
- Samples are greedily selected into the sub-batch to maximize nominal coverage and frequency flattening until the quota is filled.
4.2 CABS-Frequency Maximization (CABS-FM)
This approach selects for sub-batch samples with maximal object multiplicity, biasing training toward images that are densely annotated:
Samples are ranked by object count, and the top- are selected.
Both methods operate over a “super-batch” (default size 20,480, sub-batch post-filtering size 4,096) sampled IID from the full DataConcept pool, with the filter ratio set (default 0.8) for desired selectivity.
5. Impact on Pretraining and Evaluation Protocols
DataConcept, with CABS, provides a controllable substrate for training contrastive vision–LLMs (CLIP, SigLIP) and evaluating augmentations to the data curation process. Extensive experiments reported in (Ghosh et al., 25 Nov 2025) show:
- CABS-DM yields up to +7 percentage points improvement in zero-shot ImageNet accuracy and a +3 point gain averaged across >20 zero-shot classification benchmarks, compared to IID sampling.
- CABS-FM boosts image–text retrieval R@1 scores by up to +9 points.
- Synthetic, concept-rich recaptions independently convey up to +12pp on ImageNet relative to original alt-text, reflecting the value of explicit semantic labeling.
- Unlike prior dataset curation (MetaCLIP, GRIT-VLP, MAFA), CABS enables explicit, task-aware balancing in the online pretraining loop rather than static, offline filtering, outperforming such baselines in both classification and retrieval performance.
- DataConcept and CABS are open-sourced, with code and dataset available for reproducibility and further research.
6. Target Distributions and Use Cases
By defining explicit concept distribution targets, practitioners can optimally tailor pretraining batches for downstream task requirements:
- Zero-shot classification (e.g., ImageNet, domain transfer): uniform or debiased coverage of the long-tailed concept taxonomy (CABS-DM) leads to less class bias and higher accuracy in challenging settings.
- Image–text retrieval (MSCOCO, Flickr30k): aggregation of highly annotated, object-rich images (CABS-FM) better matches the test label statistics, improving recall.
- For exploratory analysis of batch-level concept diversity, CABS-DM achieves approximately 1.5× the number of unique concepts in a batch and a much flatter coverage profile than IID.
- The explicit grounding of concepts aligns DataConcept with applications requiring localization, open-world detection, and controlled semantic supervision.
7. Reproducibility and Ecosystem Integration
The complete DataConcept dataset (128M samples) and all CABS code are free and openly available. Core hyperparameters (filter ratio, batch sizes, max concept frequency) are documented in (Ghosh et al., 25 Nov 2025) and in the public repository. The DataConcept protocol is compatible with standard backends for large-scale, distributed multimodal pretraining.
| Component | Default Value / Source | Notes |
|---|---|---|
| Pool | DataComp XLarge (12.8B) / Uniform sample, 128M | Avoids link-rot, robust curation |
| Vocabulary size | 19,261 (bank), 12,253 (active, grounded) | Drawn from RAM++, V3Det, OpenImages, Udandarao et al. |
| CABS-DM target | per-concept, batch size 4,096 | Max cap 40, min sample per concept 1 |
| Synthetic recaption | Qwen2-VL-7B, mean length 33 tokens | 50%+ of concepts explicitly present; alt-text ≲4% |
DataConcept thus establishes a paradigm for data-centric, concept-aware, and online-controllable pretraining in foundational vision and vision–LLMs (Ghosh et al., 25 Nov 2025).