Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
99 tokens/sec
Gemini 2.5 Pro Premium
56 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
106 tokens/sec
DeepSeek R1 via Azure Premium
99 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

ShareGPT4V: Multi-modal Data & Model

Updated 7 August 2025
  • ShareGPT4V is a comprehensive multi-modal dataset and model architecture designed to facilitate robust vision-language alignment through highly descriptive, semantic captions.
  • The resource scales from an initial 100K detailed GPT-4 Vision captions to 1.2M via the Share-Captioner, providing extensive semantic grounding across diverse imagery.
  • Integration of ShareGPT4V in pre-training and supervised fine-tuning yields measurable performance improvements on benchmarks and enhances visual reasoning in multi-modal models.

ShareGPT4V is a large-scale multi-modal dataset and associated model architecture designed to advance modality alignment and benchmark performance in vision-language systems through highly descriptive, information-dense image captions. Originating from 100K captions generated with GPT-4 Vision and extended via a caption model to 1.2 million samples, ShareGPT4V is both a resource—providing rich semantic grounding across diverse imagery—and an architectural/algorithmic basis for supervised fine-tuning and pre-training in open and closed large multi-modal models. Its introduction has set new standards for image-text alignment, benchmark accuracy, and downstream utility in fields ranging from image retrieval and open-ended captioning to vision-based Q&A and multimodal reasoning.

1. Dataset Composition and Captioning Engine

ShareGPT4V comprises an initial 100K curated image-caption pairs generated directly by GPT-4 Vision on diverse images from sources such as COCO, LAION, Conceptual Captions, SAM segmentation experiments, TextCaps, WikiArt, and web-crawled content (e.g., landmarks, celebrities). Each caption is an order of magnitude longer than prior benchmarks (mean lengths: ~942 characters for 100K, ~826 for 1.2M), explicitly incorporating world knowledge, object properties, spatial relations, aesthetic evaluations, and factual detail that extend far beyond standard visual description. Through training the “Share-Captioner” model on this subset, the dataset was scaled to ShareGPT4V-PT (1.2M captions), with generated captions maintaining semantic richness and diversity comparable to their GPT-4V origins.

Table: ShareGPT4V Dataset Structure

Subset Generation Source Number of Captions Avg. Caption Length
100K Direct via GPT-4V 100,000 ~942 characters
ShareGPT4V-PT Share-Captioner (100K) 1,200,000 ~826 characters

Captions in ShareGPT4V are notable for depth and factual coverage; for example, an image of the Eiffel Tower might receive a caption not only describing architecture but also historical significance, context, and visual attributes. This semantic load is pivotal for vision-language alignment and benchmark gains.

2. Impact on Multi-Modal Model Training and Evaluation

In supervised fine-tuning (SFT), ShareGPT4V captions are used to partially or wholly replace existing SFT-caption datasets for several competitive LMMs, including LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B. Whether replacing 3.5% or 14.5% of SFT data, models receive marked performance improvements on both perception and cognition subtasks of MME and MMBench:

  • LLaVA-7B: up to +222.8 on MME perception, +2.7% on MMBench
  • LLaVA-1.5-13B, Qwen-VL-Chat-7B: gains of +22.0, +22.3 MME; +1.3%, +1.5% MMBench

These improvements emphasize the role of caption quality and diversity in driving representation alignment.

3. Pre-Training and Supervised Fine-Tuning Protocols

ShareGPT4V-PT data is used for large-scale vision-language alignment beyond traditional fixed-encoder SFT protocols. Fine-tuning includes:

  • Unlocking final blocks of the vision encoder
  • Training the two-layer MLP projector
  • Jointly updating LLM parameters

Optimization is performed using a uniform learning rate (2×10⁻⁵); pre-training proceeds for ~4700 steps and SFT for ~5200. Learning incorporates the following joint objective:

θ=argminθ,ϕ,ψ(I,T)DptL(ψ(ϕ(Encoder(I;θ)),T))\theta^* = \arg \min_{\theta,\phi,\psi} \sum_{(I,T) \in D_{pt}} L(\psi(\phi(\text{Encoder}(I;\theta)), T))

where θ\theta, ϕ\phi, ψ\psi are respective parameters for the vision encoder, MLP projector, and LLM; DptD_{pt} is the ShareGPT4V pre-training set.

For SFT, only the projector and LLM are updated (vision encoder frozen). The protocol delivers high efficiency and modular adaptability.

4. Model Architecture: ShareGPT4V-7B

The resulting ShareGPT4V-7B model leverages:

  • CLIP-Large vision encoder (336×336336 \times 336, patch size $14$, outputting $576$ tokens)
  • A lightweight 2-layer MLP for projection
  • Vicuna-v1.5 (LLaMA2-derived) LLM, 7B parameters

Despite architectural simplicity, across 11 benchmarks ShareGPT4V-7B performs at or above larger/more complex LMMs, demonstrating that data quality is as critical as model capacity.

5. Evaluation, Limitations, and Emergent Properties

Benchmark results confirm that integrating ShareGPT4V data in either pre-training or SFT phases yields measurable accuracy gains, robust captioning, and improved open-ended reasoning. These effects are exhibited in both closed-set benchmarks (MME, MMBench) and open-domain tasks. Notably, emergent abilities in abstract reasoning, factual recall, and object/property inference are attributed to the depth of semantic content within ShareGPT4V captions.

Limitations remain, as multimodal models still struggle in fine-grained discrimination tasks (see D3 benchmark (Gaur et al., 23 Sep 2024)), indicating room for improvement in detailed visual reasoning and caption-level self-retrieval.

6. Public Availability and Impact on the Field

Both the ShareGPT4V dataset and related models (including the codebase) are publicly released at https://ShareGPT4V.github.io, facilitating reproducibility and data-centric research. The transparency and detail of the captions establish new standards for image-text paired resources, offering templates for future datasets emphasizing world knowledge, spatial and aesthetic nuance, and factual context.

This approach has catalyzed research in multi-modal learning, video-language alignment (as in ShareGPT4Video (Chen et al., 6 Jun 2024)), contextual captioning (VisCon-100K (Kumar et al., 14 Feb 2025)), and safety alignment through modality-gap reduction (ReGap regularization (Yang et al., 30 May 2025)).

7. Outlook and Research Directions

ShareGPT4V’s impact extends into multiple trajectories:

  • Data-centric multi-modal alignment, prioritizing caption diversity/quality over scale alone
  • Refinement of benchmark evaluation, including self-retrieval and discriminative captioning
  • Safety alignment in LVLMs, leveraging modality, caption structure, and regularization
  • Flexible adaptation to downstream tasks (captioning, retrieval, annotation, open-ended Q&A)
  • Public open access to large-scale, semantically-rich datasets for future model development

A plausible implication is that the ShareGPT4V approach—where high-quality, diverse captions serve as alignment anchors—will be pivotal in both closed- and open-domain multimodal research, influencing model architecture design, pre-training regimes, and evaluation methodologies for years to come.