ReCap-DataComp-1B v2: Enhanced Vision-Language Dataset
- ReCap-DataComp-1B v2 is a large-scale recaptioned image–text dataset that replaces noisy captions with detailed descriptions using a LLaMA-3-powered pipeline.
- It enhances semantic alignment and textual richness by significantly increasing caption length and vocabulary diversity for improved model training.
- Integrating a two-stage fine-tuning process, the dataset yields measurable gains in zero-shot retrieval and image generation tasks.
ReCap-DataComp-1B v2 is a large-scale, recaptioned image–text dataset designed to advance training and evaluation of vision-LLMs, emphasizing high semantic alignment and textual richness in image descriptions. Derived from the DataComp-1B dataset—which itself is a filtered 1.3 billion image–text pair subset of a 12.8B web crawl—ReCap-DataComp-1B v2 replaces noisy, terse web-sourced captions with detailed, LLaMA-3-powered LLM captions, intended for broad distribution and open-source research in multimodal learning.
1. Recaptioning Methodology and Model Architecture
The central component of ReCap-DataComp-1B v2 is its recaptioning pipeline, which enhances each image–text pair by generating new, semantically richer captions. The foundation is a vision-language architecture comprising:
- Frozen Vision Backbone: CLIP ViT-L/14 for robust visual feature extraction.
- Projection MLP: Trained to map vision encoder outputs to the embedding space of the LLM.
- LLM: LLaMA-3–8B, selected for its strong open-source LLMing and reasoning capabilities, replacing the default LLM in LLaVA-1.5.
- Two-Stage Fine-Tuning:
- Stage 1: The projection MLP is trained using approximately 558K filtered image–text pairs from public datasets (e.g., LAION, CC3M, SBU).
- Stage 2: Both the projection MLP and LLaMA-3–8B language decoder are fine-tuned using an additional 665K instruction-following (i.e., conversational or reasoning) pairs plus high-quality samples from datasets like HQ-Edit.
Caption generation relies on autoregressive decoding (greedy search, maximum 128 tokens) following prompts such as: “Please generate a detailed caption of this image. Please be as descriptive as possible.”
The process uniformly recaptions all 1.3B image–text pairs, yielding the Recap-DataComp-1B dataset with substantially longer and more informative descriptions compared to the originals (Li et al., 12 Jun 2024).
2. Dataset Characterization and Enhancements
Several distinct enhancements in ReCap-DataComp-1B v2 contrast with previous web-crawled datasets:
- Caption Length and Detail: Average caption length increases from 10.22 words (web-crawled) to 49.43 words (recaptioned), integrating a broader spectrum of attributes, object descriptors, and contextual features.
- Lexical Richness: Recaptioned captions account for 82.86% of the collective token vocabulary compared to the original corpus, indicating both expansion and diversification of semantic content.
- Semantic Alignment: Assessed with a LongCLIP-Large model, the image–caption similarity score escalates to 89.91 for recaptioned text versus 10.09 for original captions.
- Human Evaluation: A GPT-4V audit over 10,000 samples reports average fluency and alignment ratings rising from 3.71 (original) to 4.14 (recaptioned).
Figure examples illustrate the qualitative difference: recaptioned samples accurately describe foreground/background relations, object colors, contextual cues, and actions, surpassing the original captions’ brevity or misalignment (Li et al., 12 Jun 2024).
3. Effects on Vision-LLM Training
Discriminative Models (e.g., CLIP)
- Models trained on Recap-DataComp-1B v2 (“Recap-CLIP”) exhibit consistent improvement on zero-shot cross-modal retrieval.
- A mixed training regime, stochastically blending original and recaptioned captions with proportion , is formalized as:
Empirically, (80% recaptioned) yields a 5% lift in standard retrieval benchmarks compared to models trained exclusively on originals.
- Increased text encoder size further augments long-context retrieval (Urban1K) and detailed attribute recognition (VG-Attribution), with score improvements sometimes in the 19–36% range (Li et al., 12 Jun 2024).
Generative Models (Text-to-Image Diffusion Transformers)
- Training DiT models on the recaptioned data yields lower FID (Fréchet Inception Distance), with reductions such as FID by 8.4 when comparing pure recaption use to pure original captions.
- CLIP and Recap-CLIP-based image-text scores both increase (–), and GPT-4V ratings on prompt-image alignment improve by 1.1 points, facilitating superior compliance with intricate or compositional text queries.
4. Optimization Strategies and Technical Specifications
The recaptioning and model training process involves several technical decisions tailored to the demands of large-scale, high-fidelity data creation:
- Training Parameters: Recap-CLIP models are trained using AdamW with large batch sizes (tens of thousands) and a two-phase procedure: initial low-resolution image training followed by higher-resolution fine-tuning.
- Tokenization: The tokenizer and text encoder accommodate up to 128 tokens per caption to handle the increased caption length from recaptioning.
- Instruction-Following Supervision: The fine-tuning corpus for the captioner leverages a substantial share of instruction-following examples, assisting the model to produce contextually appropriate, detailed, and semantically grounded descriptions.
5. Comparisons with Prior Data Filtering and Caption Modification Approaches
Earlier DataComp work focused on boosting data quality through model-based filtering and synthetic captioning methods:
Approach | Core Model(s) Used | Principal Data Enhancement | Downstream Impact |
---|---|---|---|
CLIP/BLIP-2 Filtering (Yokoo et al., 2023) | CLIP ViT-L/14, BLIP-2-COCO | Filter by image–text similarity; modify captions using BLIP-2 | 6.6%–48.5% gain vs. baselines |
ReCap-DataComp-1B v2 (Li et al., 12 Jun 2024) | LLaVA-1.5, LLaMA-3–8B | Re-caption 1.3B pairs with detailed, context-rich text | +5% retrieval, +1.1 GPT-4V |
- BLIP-2–COCO (fine-tuned on MSCOCO) emerged as optimal for similarity-based filtering in earlier DataComp challenges (Yokoo et al., 2023).
- Caption modification in prior work leveraged BLIP-2/Flan-T5-XL, selecting between generated and original captions via CLIP similarity.
- The current LLaMA-3–powered recaptioning surpasses prior synthetic captions in length, fluency, and alignment, yielding greater improvements in downstream metrics.
6. Implications for Future Foundation Model Development
The open-source release of Recap-DataComp-1B v2 is poised to substantially affect the development and evaluation of large-scale vision-LLMs:
- Data Quality: High-quality textual annotations are expected to become a key competitive differentiator for both discriminative and generative multimodal systems, with direct influence on retrieval, classification, and image generation tasks.
- Open-Source Ecosystem: Availability of a GPT-4–level, richly annotated dataset provides a public alternative to proprietary corpora maintained by industry actors.
- Research Directions: Promising future research includes further refinement of recaptioning pipelines, new ratios or blending strategies for original and generated captions, improved long-text fusion schemes, and the examination of even larger or more specialized vision-LLMs.
- Application Scope: Enhanced training data supports improved performance not only in conventional retrieval or generation settings but also enables new directions in visually grounded reasoning and interactive multimodal agents (Li et al., 12 Jun 2024).
7. Relationship to DataComp-LM and Broader Data Curation Research
The principle of dataset-centric performance improvement is echoed in parallel work on LLMing datasets such as DataComp-LM (Li et al., 17 Jun 2024):
- Both in vision-language and language-only domains, aggressive data deduplication, quality filtering (e.g., model-based fastText classifiers), and careful data mixing are crucial to maximizing downstream accuracy per unit of compute.
- For Recap-DataComp-1B v2, the blending of high-quality generated captions parallels findings that introducing or curating annotation quality leads to measurable performance gains even with fixed model architectures and training recipes.
A plausible implication is that continuous advances in high-capacity, open-source captioners and large-scale synthetic annotation will increasingly underpin future progress in multimodal learning.