Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training (2505.08971v1)

Published 13 May 2025 in cs.CV, cs.CL, and cs.LG

Abstract: In standard large vision-LLMs (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only LLM trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.

Summary

The paper introduces PRIOR, a novel approach to vision-language pre-training that focuses on enhancing the alignment between visual and linguistic data by prioritizing image-related tokens. Traditional methodologies involve training Large Vision-LLMs (LVLMs) using image-caption pairs, often relying heavily on next-token prediction (NTP) objectives. This method can inadvertently prioritize noise, as it treats all caption tokens equally, despite a significant proportion being irrelevant to the visual content. PRIOR addresses this limitation by employing a strategic weighting system that emphasizes tokens associated more directly with the visual input.

Methodology

PRIOR utilizes probability scores derived from a text-only LLM to recalibrate the loss function at the token level. This approach distinguishes between image-related tokens and others, allocating a higher training priority to those with lower predicted probabilities absent the visual context. Essentially, the implementation utilizes an LLM trained solely on caption data, without visual inputs, thereby identifying tokens that are more challenging to predict sans image data—hypothetically, the tokens possessing significant visual correlation.

There are two distinct configurations of LVLMs where PRIOR is applied: one employing visual encoders and another without. Across several benchmarks, PRIOR exhibits an average relative improvement of 19% and 8% in respective scenarios. Notably, it maintains superior scaling attributes; the scaling coefficients suggest that PRIOR offers substantial performance enhancements with increased computational resources and data availability.

Experimental Outcomes

Experiments indicate that PRIOR consistently outperforms conventional NTP objectives throughout training. It not only provides an overall performance boost but also ensures greater training stability through reduced fluctuations in downstream tasks. The method appears to harness the full potential of extended training, largely due to its selective emphasis on optimizing tokens pertinent to visual content.

Critical Analysis

PRIOR challenges the traditional reliance on uniform token treatment, which may lead to models heavily dependent on textual context and prone to hallucinations in visual comprehension tasks. By prioritizing harder-to-predict tokens, PRIOR not only bolsters the model's ability to utilize visual data effectively but also maintains robust language generation capabilities.

One of the notable strengths of PRIOR is its ability to seamlessly integrate with existing frameworks with minimal adjustments, aside from an enhanced loss computation strategy. This scalability presents an attractive proposition for deploying LVLMs across extensive datasets without necessitating major infrastructural overhauls.

Future Implications

The theoretical and practical frameworks introduced by PRIOR hold promising implications for future AI developments. As computational demands grow, methods like PRIOR that judiciously manage and prioritize data-processing tasks will become increasingly valuable. They offer pathways toward more adaptable, efficient systems that can scale predictably across varied tasks and datasets.

In summary, PRIOR provides a pragmatic advancement in vision-language pre-training by leveraging importance sampling principles to refine token prioritization within LLM objectives. This approach not only ameliorates token-related biases in image-caption datasets but also optimizes the training dynamics, with strong implications for improving model stability and performance predictability.

Tweets

https://twitter.com/YangyiChen6666/status/1923750619354910783