Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 67 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ShareGPT4V: Multi-modal Data & Model

Updated 7 August 2025
  • ShareGPT4V is a comprehensive multi-modal dataset and model architecture designed to facilitate robust vision-language alignment through highly descriptive, semantic captions.
  • The resource scales from an initial 100K detailed GPT-4 Vision captions to 1.2M via the Share-Captioner, providing extensive semantic grounding across diverse imagery.
  • Integration of ShareGPT4V in pre-training and supervised fine-tuning yields measurable performance improvements on benchmarks and enhances visual reasoning in multi-modal models.

ShareGPT4V is a large-scale multi-modal dataset and associated model architecture designed to advance modality alignment and benchmark performance in vision-language systems through highly descriptive, information-dense image captions. Originating from 100K captions generated with GPT-4 Vision and extended via a caption model to 1.2 million samples, ShareGPT4V is both a resource—providing rich semantic grounding across diverse imagery—and an architectural/algorithmic basis for supervised fine-tuning and pre-training in open and closed large multi-modal models. Its introduction has set new standards for image-text alignment, benchmark accuracy, and downstream utility in fields ranging from image retrieval and open-ended captioning to vision-based Q&A and multimodal reasoning.

1. Dataset Composition and Captioning Engine

ShareGPT4V comprises an initial 100K curated image-caption pairs generated directly by GPT-4 Vision on diverse images from sources such as COCO, LAION, Conceptual Captions, SAM segmentation experiments, TextCaps, WikiArt, and web-crawled content (e.g., landmarks, celebrities). Each caption is an order of magnitude longer than prior benchmarks (mean lengths: ~942 characters for 100K, ~826 for 1.2M), explicitly incorporating world knowledge, object properties, spatial relations, aesthetic evaluations, and factual detail that extend far beyond standard visual description. Through training the “Share-Captioner” model on this subset, the dataset was scaled to ShareGPT4V-PT (1.2M captions), with generated captions maintaining semantic richness and diversity comparable to their GPT-4V origins.

Table: ShareGPT4V Dataset Structure

Subset Generation Source Number of Captions Avg. Caption Length
100K Direct via GPT-4V 100,000 ~942 characters
ShareGPT4V-PT Share-Captioner (100K) 1,200,000 ~826 characters

Captions in ShareGPT4V are notable for depth and factual coverage; for example, an image of the Eiffel Tower might receive a caption not only describing architecture but also historical significance, context, and visual attributes. This semantic load is pivotal for vision-language alignment and benchmark gains.

2. Impact on Multi-Modal Model Training and Evaluation

In supervised fine-tuning (SFT), ShareGPT4V captions are used to partially or wholly replace existing SFT-caption datasets for several competitive LMMs, including LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B. Whether replacing 3.5% or 14.5% of SFT data, models receive marked performance improvements on both perception and cognition subtasks of MME and MMBench:

  • LLaVA-7B: up to +222.8 on MME perception, +2.7% on MMBench
  • LLaVA-1.5-13B, Qwen-VL-Chat-7B: gains of +22.0, +22.3 MME; +1.3%, +1.5% MMBench

These improvements emphasize the role of caption quality and diversity in driving representation alignment.

3. Pre-Training and Supervised Fine-Tuning Protocols

ShareGPT4V-PT data is used for large-scale vision-language alignment beyond traditional fixed-encoder SFT protocols. Fine-tuning includes:

  • Unlocking final blocks of the vision encoder
  • Training the two-layer MLP projector
  • Jointly updating LLM parameters

Optimization is performed using a uniform learning rate (2×10⁻⁵); pre-training proceeds for ~4700 steps and SFT for ~5200. Learning incorporates the following joint objective:

θ=argminθ,ϕ,ψ(I,T)DptL(ψ(ϕ(Encoder(I;θ)),T))\theta^* = \arg \min_{\theta,\phi,\psi} \sum_{(I,T) \in D_{pt}} L(\psi(\phi(\text{Encoder}(I;\theta)), T))

where θ\theta, ϕ\phi, ψ\psi are respective parameters for the vision encoder, MLP projector, and LLM; DptD_{pt} is the ShareGPT4V pre-training set.

For SFT, only the projector and LLM are updated (vision encoder frozen). The protocol delivers high efficiency and modular adaptability.

4. Model Architecture: ShareGPT4V-7B

The resulting ShareGPT4V-7B model leverages:

  • CLIP-Large vision encoder (336×336336 \times 336, patch size $14$, outputting $576$ tokens)
  • A lightweight 2-layer MLP for projection
  • Vicuna-v1.5 (LLaMA2-derived) LLM, 7B parameters

Despite architectural simplicity, across 11 benchmarks ShareGPT4V-7B performs at or above larger/more complex LMMs, demonstrating that data quality is as critical as model capacity.

5. Evaluation, Limitations, and Emergent Properties

Benchmark results confirm that integrating ShareGPT4V data in either pre-training or SFT phases yields measurable accuracy gains, robust captioning, and improved open-ended reasoning. These effects are exhibited in both closed-set benchmarks (MME, MMBench) and open-domain tasks. Notably, emergent abilities in abstract reasoning, factual recall, and object/property inference are attributed to the depth of semantic content within ShareGPT4V captions.

Limitations remain, as multimodal models still struggle in fine-grained discrimination tasks (see D3 benchmark (Gaur et al., 23 Sep 2024)), indicating room for improvement in detailed visual reasoning and caption-level self-retrieval.

6. Public Availability and Impact on the Field

Both the ShareGPT4V dataset and related models (including the codebase) are publicly released at https://ShareGPT4V.github.io, facilitating reproducibility and data-centric research. The transparency and detail of the captions establish new standards for image-text paired resources, offering templates for future datasets emphasizing world knowledge, spatial and aesthetic nuance, and factual context.

This approach has catalyzed research in multi-modal learning, video-language alignment (as in ShareGPT4Video (Chen et al., 6 Jun 2024)), contextual captioning (VisCon-100K (Kumar et al., 14 Feb 2025)), and safety alignment through modality-gap reduction (ReGap regularization (Yang et al., 30 May 2025)).

7. Outlook and Research Directions

ShareGPT4V’s impact extends into multiple trajectories:

  • Data-centric multi-modal alignment, prioritizing caption diversity/quality over scale alone
  • Refinement of benchmark evaluation, including self-retrieval and discriminative captioning
  • Safety alignment in LVLMs, leveraging modality, caption structure, and regularization
  • Flexible adaptation to downstream tasks (captioning, retrieval, annotation, open-ended Q&A)
  • Public open access to large-scale, semantically-rich datasets for future model development

A plausible implication is that the ShareGPT4V approach—where high-quality, diverse captions serve as alignment anchors—will be pivotal in both closed- and open-domain multimodal research, influencing model architecture design, pre-training regimes, and evaluation methodologies for years to come.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ShareGPT4V.