Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Pixels to Prose: A Large Dataset of Dense Image Captions (2406.10328v1)

Published 14 Jun 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Training large vision-LLMs requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-LLMs for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose

Citations (10)

Summary

  • The paper introduces PixelProse, a novel dataset that generates over 16M dense image captions using five diverse prompts from the Google Gemini 1.0 Pro Vision Model.
  • It overcomes sparse and noisy alt-text limitations by providing extensive details on objects, spatial relations, and background attributes.
  • Rigorous quality assurance and ethical filtering reduce toxicity to 0.13%, ensuring reliable data for advanced vision–language applications.

An Overview of "From Pixels to Prose: A Large Dataset of Dense Image Captions"

"From Pixels to Prose: A Large Dataset of Dense Image Captions" presents PixelProse, a novel dataset aimed at addressing significant gaps in current vision-LLM training data. Developed by Singla et al., PixelProse comprises over 16 million synthetically generated image captions using the Google Gemini 1.0 Pro Vision Model. This intricate and detailed dataset represents a leap forward in the quality of image-text pairs, potentially enabling enhanced performance in various vision-LLM (VLM) applications.

Motivation and Dataset Composition

The authors identify a fundamental bottleneck in current vision-language research: the reliance on noisy, web-scraped datasets like LAION and CommonCrawl, which are populated with sparse and often irrelevant alt-text labels. These existing datasets fail to provide the necessary granularity, particularly when it comes to background details, object attributes, and spatial relations. This deficiency has curtailed the performance of open-source models in comparison to their commercial counterparts, which benefit from carefully curated datasets. PixelProse addresses these limitations by offering dense, meticulous captions covering a wide array of image properties.

Data Generation and Sources

PixelProse aggregates over 16 million images from three primary sources: CommonPool, CC12M, and RedCaps. Each source contributes diverse content to the dataset:

  • CommonPool: Provides a broad range of images with extensive metadata, albeit with variable quality.
  • CC12M: Offers higher image quality through a more stringent curation pipeline.
  • RedCaps: Curated from Reddit, this subset includes high-quality images with non-descriptive captions, necessitating re-captioning for detailed annotations.

To generate high-quality captions, the authors designed a sophisticated text captioning pipeline leveraging the capabilities of the Google Gemini 1.0 Pro Vision Model. Five unique prompts were employed to generate diverse and context-rich descriptions, ensuring comprehensive coverage of positional, textual, and stylistic attributes within the images.

Quality Assurance and Ethical Considerations

The creation of PixelProse involved meticulous quality control and ethical scrutiny. The dataset is rigorously vetted to exclude problematic content such as CSAM, PII, and toxic language. Notably, the authors employed multiple filtering mechanisms, including PhotoDNA, Google Vision API, and the Gemini API. These tools collectively ensure that the dataset adheres to high ethical standards.

The paper provides compelling evidence of PixelProse’s reduced toxicity compared to existing datasets, with only 0.13% of the captions flagged as toxic using Detoxify. This marks a significant improvement in dataset safety and reliability, critical for preventing harmful model outputs in real-world applications.

Linguistic and Contextual Richness

PixelProse significantly surpasses existing datasets in terms of linguistic richness and contextual detail. On average, the generated captions contain 506 characters, considerably longer than the 101 characters typically found in the original captions. This enhanced verbosity translates to approximately 1.7 billion tokens, underscoring the dataset's comprehensive nature.

Moreover, the dataset demonstrates remarkable noun diversity, featuring a broader vocabulary than other leading datasets such as ALLaVA and ShareGPT4V. This is particularly beneficial for vision-language tasks that require nuanced object recognition and description.

Practical Implications and Applications

PixelProse is designed for versatility and can be refactored into multiple formats to serve various applications, including pre-training tasks, image captioning, and vision question-answering (VQA). The dataset's dense captions enable robust LLM refactoring, facilitating the generation of high-quality VQA pairs, instructional content, and other formats essential for advanced vision-language research.

Future Directions

The authors hint at numerous avenues for future exploration. Given PixelProse’s detailed captions, researchers can investigate more effective methods for refactoring dense captions into specific instruction sets. Additionally, the dataset's rich annotations can drive advancements in multimodal applications, including but not limited to image generation, object recognition, and contextual scene understanding.

Conclusion

PixelProse represents a substantial advancement in the quality and utility of image caption datasets. By providing richly detailed and contextually comprehensive captions, this dataset addresses critical deficiencies in current vision-LLM training data. It promises to be an invaluable resource for future research, enabling open-source models to achieve performances previously reserved for commercial systems. The authors' rigorous approach to quality assurance and ethical considerations further solidifies PixelProse as a benchmark in responsible AI research.