ShareGPT4V: Multi-modal Data & Model
- ShareGPT4V is a comprehensive multi-modal dataset and model architecture designed to facilitate robust vision-language alignment through highly descriptive, semantic captions.
- The resource scales from an initial 100K detailed GPT-4 Vision captions to 1.2M via the Share-Captioner, providing extensive semantic grounding across diverse imagery.
- Integration of ShareGPT4V in pre-training and supervised fine-tuning yields measurable performance improvements on benchmarks and enhances visual reasoning in multi-modal models.
ShareGPT4V is a large-scale multi-modal dataset and associated model architecture designed to advance modality alignment and benchmark performance in vision-language systems through highly descriptive, information-dense image captions. Originating from 100K captions generated with GPT-4 Vision and extended via a caption model to 1.2 million samples, ShareGPT4V is both a resource—providing rich semantic grounding across diverse imagery—and an architectural/algorithmic basis for supervised fine-tuning and pre-training in open and closed large multi-modal models. Its introduction has set new standards for image-text alignment, benchmark accuracy, and downstream utility in fields ranging from image retrieval and open-ended captioning to vision-based Q&A and multimodal reasoning.
1. Dataset Composition and Captioning Engine
ShareGPT4V comprises an initial 100K curated image-caption pairs generated directly by GPT-4 Vision on diverse images from sources such as COCO, LAION, Conceptual Captions, SAM segmentation experiments, TextCaps, WikiArt, and web-crawled content (e.g., landmarks, celebrities). Each caption is an order of magnitude longer than prior benchmarks (mean lengths: ~942 characters for 100K, ~826 for 1.2M), explicitly incorporating world knowledge, object properties, spatial relations, aesthetic evaluations, and factual detail that extend far beyond standard visual description. Through training the “Share-Captioner” model on this subset, the dataset was scaled to ShareGPT4V-PT (1.2M captions), with generated captions maintaining semantic richness and diversity comparable to their GPT-4V origins.
Table: ShareGPT4V Dataset Structure
Subset | Generation Source | Number of Captions | Avg. Caption Length |
---|---|---|---|
100K | Direct via GPT-4V | 100,000 | ~942 characters |
ShareGPT4V-PT | Share-Captioner (100K) | 1,200,000 | ~826 characters |
Captions in ShareGPT4V are notable for depth and factual coverage; for example, an image of the Eiffel Tower might receive a caption not only describing architecture but also historical significance, context, and visual attributes. This semantic load is pivotal for vision-language alignment and benchmark gains.
2. Impact on Multi-Modal Model Training and Evaluation
In supervised fine-tuning (SFT), ShareGPT4V captions are used to partially or wholly replace existing SFT-caption datasets for several competitive LMMs, including LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B. Whether replacing 3.5% or 14.5% of SFT data, models receive marked performance improvements on both perception and cognition subtasks of MME and MMBench:
- LLaVA-7B: up to +222.8 on MME perception, +2.7% on MMBench
- LLaVA-1.5-13B, Qwen-VL-Chat-7B: gains of +22.0, +22.3 MME; +1.3%, +1.5% MMBench
These improvements emphasize the role of caption quality and diversity in driving representation alignment.
3. Pre-Training and Supervised Fine-Tuning Protocols
ShareGPT4V-PT data is used for large-scale vision-language alignment beyond traditional fixed-encoder SFT protocols. Fine-tuning includes:
- Unlocking final blocks of the vision encoder
- Training the two-layer MLP projector
- Jointly updating LLM parameters
Optimization is performed using a uniform learning rate (2×10⁻⁵); pre-training proceeds for ~4700 steps and SFT for ~5200. Learning incorporates the following joint objective:
where , , are respective parameters for the vision encoder, MLP projector, and LLM; is the ShareGPT4V pre-training set.
For SFT, only the projector and LLM are updated (vision encoder frozen). The protocol delivers high efficiency and modular adaptability.
4. Model Architecture: ShareGPT4V-7B
The resulting ShareGPT4V-7B model leverages:
- CLIP-Large vision encoder (, patch size $14$, outputting $576$ tokens)
- A lightweight 2-layer MLP for projection
- Vicuna-v1.5 (LLaMA2-derived) LLM, 7B parameters
Despite architectural simplicity, across 11 benchmarks ShareGPT4V-7B performs at or above larger/more complex LMMs, demonstrating that data quality is as critical as model capacity.
5. Evaluation, Limitations, and Emergent Properties
Benchmark results confirm that integrating ShareGPT4V data in either pre-training or SFT phases yields measurable accuracy gains, robust captioning, and improved open-ended reasoning. These effects are exhibited in both closed-set benchmarks (MME, MMBench) and open-domain tasks. Notably, emergent abilities in abstract reasoning, factual recall, and object/property inference are attributed to the depth of semantic content within ShareGPT4V captions.
Limitations remain, as multimodal models still struggle in fine-grained discrimination tasks (see D3 benchmark (Gaur et al., 23 Sep 2024)), indicating room for improvement in detailed visual reasoning and caption-level self-retrieval.
6. Public Availability and Impact on the Field
Both the ShareGPT4V dataset and related models (including the codebase) are publicly released at https://ShareGPT4V.github.io, facilitating reproducibility and data-centric research. The transparency and detail of the captions establish new standards for image-text paired resources, offering templates for future datasets emphasizing world knowledge, spatial and aesthetic nuance, and factual context.
This approach has catalyzed research in multi-modal learning, video-language alignment (as in ShareGPT4Video (Chen et al., 6 Jun 2024)), contextual captioning (VisCon-100K (Kumar et al., 14 Feb 2025)), and safety alignment through modality-gap reduction (ReGap regularization (Yang et al., 30 May 2025)).
7. Outlook and Research Directions
ShareGPT4V’s impact extends into multiple trajectories:
- Data-centric multi-modal alignment, prioritizing caption diversity/quality over scale alone
- Refinement of benchmark evaluation, including self-retrieval and discriminative captioning
- Safety alignment in LVLMs, leveraging modality, caption structure, and regularization
- Flexible adaptation to downstream tasks (captioning, retrieval, annotation, open-ended Q&A)
- Public open access to large-scale, semantically-rich datasets for future model development
A plausible implication is that the ShareGPT4V approach—where high-quality, diverse captions serve as alignment anchors—will be pivotal in both closed- and open-domain multimodal research, influencing model architecture design, pre-training regimes, and evaluation methodologies for years to come.