ScaleCap: Image Captioning Framework
- ScaleCap is an inference-time framework designed to generate comprehensive and accurate image captions by addressing multimodal and linguistic biases in LVLMs.
- It uses a dual-modality debiasing pipeline involving Heuristic Question Answering (HQA) for enrichment and Contrastive Sentence Rating (CSR) for factual correction, controllable by computational budget.
- ScaleCap-generated captions improve performance when pretraining LVLMs on various benchmarks and demonstrate data efficiency and hallucination reduction compared to alternative methods.
ScaleCap refers to an inference-time scalable image captioning framework designed to produce comprehensive, detailed, and accurate captions while addressing two central challenges of current vision-LLMs (LVLMs): multimodal bias, which produces unbalanced and incomplete object descriptions, and linguistic bias, which leads to hallucinated or ungrounded content. Through a dual-modality debiasing pipeline comprising heuristic question answering (HQA) and contrastive sentence rating (CSR), ScaleCap enables progressive enrichment and factual correction of captions, directly controlled by the inference-time computational budget. The methodology produces data-efficient, modality-aligned, and information-rich captions that have broad utility for pretraining LVLMs and evaluating visual semantic coverage (2506.19848).
1. Motivations and Challenges
LVLMs have advanced in generating image captions but often fall short in descriptive granularity and factual accuracy. These deficits are primarily attributed to:
- Multimodal bias, which results in over-description of salient or prominent objects and neglect of less salient elements. This creates captions of variable granularity, undermining the completeness of visual descriptions.
- Linguistic bias, manifesting as hallucination, whereby the model invents objects or attributes not present in the image, often driven by LLM priors or dataset co-occurrence statistics.
Traditional approaches, including tool-based pipelines (e.g., object detectors or taggers), partially address completeness but are insufficient in scale and flexibility. A scalable, training-free solution that operates at inference time is sought to maximize utility across datasets and downstream tasks.
2. Architecture: Heuristic Question Answering and Contrastive Sentence Rating
Heuristic Question Answering (HQA)
HQA progressively enriches image captions by generating and answering content-specific questions derived from the current caption. This process unfolds as follows:
- Initial Caption and Golden Sentence Extraction: The model generates an initial caption, from which "golden sentences"—atomic, object- or region-specific statements—are extracted.
- Instruction Generation: For each golden sentence, the framework leverages an LLM to produce detailed prompts, e.g., “Describe more details about the airplane,” and similar spatial or relational questions.
- Visual Question Answering: A compact LVLM answers these prompts, yielding object-appearance and spatial-relationship details (, ).
- Integration: Iterative rounds (up to a user-specified budget ) progressively enrich the caption by incorporating additional retrieved information.
Contrastive Sentence Rating (CSR)
CSR aims to systematically purge hallucinated or ungrounded content at the sentence level, providing global coherence and correcting errors introduced by LLM priors:
- Token Probability Comparisons: For each sentence , token probabilities are computed with the image () and without the image (), using the LVLM.
- Contrastive Score: The per-token difference is calculated. Sentences are retained as goldens if the maximum among content tokens exceeds a specified threshold (), indicating image-groundedness:
- Offline, Sentence-level Filtering: This approach is favored over online/token-level methods, maintaining caption fluency and avoiding local incoherence.
All outputs are then summarized into a cohesive and informative caption using a large-scale LLM.
3. Debiasing and Progressive Enrichment Mechanism
ScaleCap’s design explicitly debiases captions by:
- Addressing Multimodal Bias: Iterative question answering uncovers and integrates overlooked or under-described elements, yielding balanced, granular object, attribute, and spatial relationship coverage.
- Mitigating Linguistic Bias: CSR removes linguistically plausible but visually unsupported content, safeguarding factual precision and image-alignment.
- Scalability: The process is governed by a scale budget , affording direct user control over richness and inference cost. Results indicate diminishing returns after approximately 20 iterations for most images, where completeness is saturated.
4. Empirical Evaluation and Modality Alignment
ScaleCap’s efficacy is substantiated through extensive experiments:
- Modality Alignment: Pretraining LVLMs on ScaleCap-generated captions (ScaleCap-450k) robustly improves performance on 11 major benchmarks, including InfoVQA, DocVQA, MathVista, MMVet, SEED, and AI2D.
- For Qwen2.5-7B, ScaleCap-450k yields a +4.3% improvement on InfoVQA over ShareGPT4V-450k, and further advancements on other datasets.
- Data Efficiency: With equal pretraining volumes, ScaleCap-annotated datasets consistently outperform those generated by alternative SOTA methods.
- Hallucination Reduction: Evaluation using the CHAIR benchmark confirms that CSR outperforms rival contrastive decoding strategies in suppressing hallucinated content.
- Downstream Tasks:
- VQA Replacement: In the Prism framework, replacing the image with captions generated by ScaleCap enables even small models to match or exceed the performance of large LVLMs using their own captions, demonstrating the captions’ informativeness.
- Image Reconstruction: Human evaluation of images generated from ScaleCap captions via a state-of-the-art generator shows closer alignment to original images compared to those based on captions from baselines such as GPT-4o.
5. Inference-Time Scalability and Design Considerations
- Scale Budget : The number of heuristic questions allowed governs computational cost and caption detail. This mechanism is directly tunable at runtime, requiring no pre- or post-training adjustments.
- Component Sizing: Visual question answering is efficiently supported by small LVLMs (e.g., Qwen2-VL-7B), with little gain from larger models beyond this capacity, while summarization benefits from larger LLMs (e.g., 72B).
- Modularity: The pipeline is model-agnostic and can incorporate the latest open-source or proprietary LLM/LVLM models for each processing step.
6. Implementation and Dataset Construction
ScaleCap is available as open-source software (https://github.com/Cooperx521/ScaleCap), including:
- Dataset Construction Procedures: Full instructions for constructing the ScaleCap-450k dataset, filtering criteria, and object/region extraction prompts.
- Model Training Recipes: Detailed parameter settings for pretraining and instruction tuning, applicable to multiple backbone model families.
- Prompting Templates and User Study Instructions: All prompts and evaluation protocols are detailed in the supplementary material, supporting reproducibility and benchmarking.
- Compositional Flexibility: The framework supports modular use of models for vision-language grounding, summarization, and question answering.
7. Impact and Limitations
ScaleCap establishes a scalable, bias-mitigating approach to image captioning that is both informative and data-efficient, directly advancing the quality and utility of multimodal pretraining resources for LVLMs. Fundamental contributions include the resolution of multimodal and linguistic biases in a training-free manner and the demonstration of controllable, inference-time caption enrichment.
A noted limitation is the residual risk of uncaught harmful content and the absence of explicit semantic-level debiasing. Societal impacts and ethical considerations are discussed in the project appendix.
Challenge | ScaleCap Solution | Effect on Caption Quality |
---|---|---|
Multimodal bias | Heuristic Question Answering | Improved coverage, scalable granularity |
Linguistic bias | Contrastive Sentence Rating | Reduced hallucination, factual grounding |
Scalability/Cost | Inference-time control (budget) | Tunable richness/efficiency tradeoff |
Training-free, Flexible | Modular pipeline | Integration with latest models, easy adoption |
Informativeness | VQA/image reconstruction tasks | Captions faithfully encode visual semantics |
ScaleCap is thus positioned as a scalable, modular framework enabling inference-time control of caption informativeness and fidelity, achieving state-of-the-art results in vision-language pretraining and robustly addressing long-standing captioning biases (2506.19848).