DataConcept: Concept-Aware Image-Text Dataset

Updated 27 November 2025

DataConcept is a large-scale, concept-annotated image–text dataset enriched with localized object and attribute annotations for precise multimodal pretraining.
It employs a robust annotation pipeline that merges diverse concept sources and uses multi-resolution grounding to achieve both depth and breadth in visual concepts.
Its Concept-Aware Batch Sampling (CABS) strategies, including diversity and frequency maximization, significantly improve zero-shot classification and image–text retrieval performance.

DataConcept refers to a large-scale, concept-annotated image–text dataset and the associated framework for task-adaptive, concept-aware data curation in multimodal pretraining. The DataConcept resource consists of 128 million web-crawled image–text pairs, each enriched with localized, fine-grained object and attribute annotations drawn from an open-vocabulary, hierarchical concept bank. It serves as both a foundation and a testbed for the development and evaluation of advanced batch sampling and training strategies in vision–LLMs, particularly Concept-Aware Batch Sampling (CABS), which enables direct, online control over the concept distributions seen by models during pretraining (Ghosh et al., 25 Nov 2025).

1. Dataset Scope and Construction

DataConcept is constructed by sampling 128M pairs from the DataComp XLarge pool (12.8B samples) without prior filtering. Each sample is an image–text pair $(\mathcal{I}_i,\,\mathcal{T}_i)$ . The annotation pipeline proceeds as follows:

Concept-bank formation: The initial vocabulary ( $~19{,}261$ entries) is a merge of RAM++ (“Rapid Automatic Multi-labeling”, ~4k items), V3Det classes, OpenImages labels, and mined concepts from Udandarao et al. The bank is normalized by lemmatization, syntactic cleaning, synonym coalescing (using WordNet and Sentence-Transformer cosine similarity), and explicit curation to remove unsafe or sensitive terms.
Image-level tagging: RAM++ tags each image at 384×384 resolution with concepts above a confidence of 0.75, producing preliminary concept sets $\tilde{\mathcal{C}}_i$ .
Concept grounding: For each $(\mathcal{I}_i, \tilde{\mathcal{C}}_i)$ , the prefiltered concepts are passed as prompts to GroundingDINO at four resolutions $\{384, 512, 800, 1000\}$ to obtain bounding-box localizations, using text–box similarity and box confidence thresholds of 0.27. Weighted Box Fusion clusters predictions with IoU>0.29 and fuses them for stability.
Box and concept post-processing: Only concepts that can be localized are retained, and all boxes are associated with detection confidence. This leads to a final set of 12,253 unique grounded concepts across the 128M samples.
Concept-aware recaptioning: For each $(\mathcal{I}_i, \mathcal{T}_i, \mathcal{C}_i)$ , Qwen2-VL-7B generates a synthetic caption $\mathcal{R}_i$ incorporating the concept set. These recaptions are significantly longer (mean ~33 tokens) and cover a larger fraction of the annotated concepts than raw alt-text ( $\mathcal{T}_i$ ).

Each DataConcept record is: $(\mathcal{I}_i,\,\mathcal{T}_i,\,\mathcal{R}_i,\,\{(b_{ij},c_{ij},s_{ij})\}_{j\in J_i}),$ where each $b_{ij}$ is a bounding box, $c_{ij}$ a concept id, and $s_{ij}$ a confidence score.

2. Concept Annotation Schema and Taxonomy

Concepts are understood as objects or attributes that can be detected and localized in natural images. The taxonomy is hierarchical, encompassing entities from RAM++, OpenImages, and V3Det; it has 12,253 unique leaf concepts empirically present in the data. Annotation proceeds by model-based multi-label tagging, guided object detection, fusion and filtering operations. Coverage is both deep (capturing rare, fine-grained concepts) and broad (including frequent, generic classes), ensuring diversity and granularity suitable for modern vision-language research.

At the sample level, the number of annotated concepts per image $|\mathcal{C}_i|$ is median 3, with a long right tail ( $\geq 30$ in some images). At the dataset level, concept frequencies exhibit a heavy-tailed distribution: 86 concepts appear in over 1M images, 5,326 in at least 1k.

3. Mathematical Formulation: Labels and Distributions

The mathematical representation of concept supervision in DataConcept is as follows:

Each sample $i$ possesses a concept set $\mathcal{C}_i \subseteq \{1,\,\ldots,\,12{,}253\}$ .
For each concept $c$ , frequency $\mathcal{F}_c = \sum_i \mathbf{1}[c\in\mathcal{C}_i]$ measures its global prevalence.
The distribution of objects per sample and per batch underpins the batch sampling algorithms that interact with the dataset. For targeted pretraining, concept-label distributions can be explicitly controlled at sub-batch granularity.

4. Concept-Aware Batch Sampling (CABS) Framework

CABS implements online, deterministic, on-the-fly sub-batch construction with two main paradigms:

4.1 CABS-Diversity Maximization (CABS-DM)

The goal is to approximate a uniform per-concept frequency in each sub-batch, combating the skew inherent in large-scale, web-scraped data. The algorithm:

Maintains a tally $n_c$ of the number of samples for each concept $c$ already included in the batch.
Assigns each sample $i$ a gain score:

$h_{\mathrm{DM}}(i) = \frac{1}{|\mathcal{C}_i|}\, \sum_{c\in\mathcal{C}_i} \begin{cases} \frac{t_c - n_c}{t_c} + \frac{1}{\mathcal{F}_c}, & n_c < t_c \ 0, & n_c \geq t_c \end{cases}$

with target per-concept count $t_c \approx b \cdot (\mathcal{F}_c / \sum_c \mathcal{F}_c)$ for sub-batch size $b$ .

Samples are greedily selected into the sub-batch to maximize nominal coverage and frequency flattening until the quota is filled.

4.2 CABS-Frequency Maximization (CABS-FM)

This approach selects for sub-batch samples with maximal object multiplicity, biasing training toward images that are densely annotated:

$h_{\mathrm{FM}}(i) = |\mathcal{C}_i|$

Samples are ranked by object count, and the top- $b$ are selected.

Both methods operate over a “super-batch” (default size 20,480, sub-batch post-filtering size 4,096) sampled IID from the full DataConcept pool, with the filter ratio set (default 0.8) for desired selectivity.

5. Impact on Pretraining and Evaluation Protocols

DataConcept, with CABS, provides a controllable substrate for training contrastive vision–LLMs (CLIP, SigLIP) and evaluating augmentations to the data curation process. Extensive experiments reported in (Ghosh et al., 25 Nov 2025) show:

CABS-DM yields up to +7 percentage points improvement in zero-shot ImageNet accuracy and a +3 point gain averaged across >20 zero-shot classification benchmarks, compared to IID sampling.
CABS-FM boosts image–text retrieval R@1 scores by up to +9 points.
Synthetic, concept-rich recaptions independently convey up to +12pp on ImageNet relative to original alt-text, reflecting the value of explicit semantic labeling.
Unlike prior dataset curation (MetaCLIP, GRIT-VLP, MAFA), CABS enables explicit, task-aware balancing in the online pretraining loop rather than static, offline filtering, outperforming such baselines in both classification and retrieval performance.
DataConcept and CABS are open-sourced, with code and dataset available for reproducibility and further research.

6. Target Distributions and Use Cases

By defining explicit concept distribution targets, practitioners can optimally tailor pretraining batches for downstream task requirements:

Zero-shot classification (e.g., ImageNet, domain transfer): uniform or debiased coverage of the long-tailed concept taxonomy (CABS-DM) leads to less class bias and higher accuracy in challenging settings.
Image–text retrieval (MSCOCO, Flickr30k): aggregation of highly annotated, object-rich images (CABS-FM) better matches the test label statistics, improving recall.
For exploratory analysis of batch-level concept diversity, CABS-DM achieves approximately 1.5× the number of unique concepts in a batch and a much flatter coverage profile than IID.
The explicit grounding of concepts aligns DataConcept with applications requiring localization, open-world detection, and controlled semantic supervision.

7. Reproducibility and Ecosystem Integration

The complete DataConcept dataset (128M samples) and all CABS code are free and openly available. Core hyperparameters (filter ratio, batch sizes, max concept frequency) are documented in (Ghosh et al., 25 Nov 2025) and in the public repository. The DataConcept protocol is compatible with standard backends for large-scale, distributed multimodal pretraining.

Component	Default Value / Source	Notes
Pool	DataComp XLarge (12.8B) / Uniform sample, 128M	Avoids link-rot, robust curation
Vocabulary size	19,261 (bank), 12,253 (active, grounded)	Drawn from RAM++, V3Det, OpenImages, Udandarao et al.
CABS-DM target	$t_c$ per-concept, batch size 4,096	Max cap 40, min sample per concept 1
Synthetic recaption	Qwen2-VL-7B, mean length 33 tokens	50%+ of concepts explicitly present; alt-text ≲4%

DataConcept thus establishes a paradigm for data-centric, concept-aware, and online-controllable pretraining in foundational vision and vision–LLMs (Ghosh et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Concept-Aware Batch Sampling Improves Language-Image Pretraining (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DataConcept.