Papers
Topics
Authors
Recent
2000 character limit reached

CropVLM: RL Cropping for Vision-Language Models

Updated 27 November 2025
  • CropVLM is a cropping policy that improves fine-grained vision-language perception by dynamically zooming in on high-relevance image regions.
  • It employs a two-stage training process combining synthetic crop supervision and reinforcement learning via GRPO to optimize crop selection.
  • The modular design enhances VQA performance with minimal compute and annotation overhead while preserving full-image context.

CropVLM is an external, reinforcement-learning-based cropping policy designed to improve fine-grained vision-language perception in state-of-the-art Vision-LLMs (VLMs) without modifying the VLMs themselves. It enables dynamic “zooming in” on high-relevance image regions at inference, enhancing tasks such as scene-text recognition or document understanding where small scale details are critical. CropVLM produces significant accuracy improvements with minimal compute or annotation overhead by proposing one task-adaptive, high-resolution crop per image, in addition to the global view. The policy is trained using only synthetic or model-generated crops for supervised pre-training and then optimized using reinforcement learning—specifically Grouped Relative Policy Optimization (GRPO)—without requiring human-labeled bounding boxes or changes to the target VLM weights (Carvalho et al., 25 Nov 2025).

1. Problem Motivation and System Overview

Modern Transformer-based VLMs accept only modest-resolution image inputs (typically 224×224 or 336×336 px) due to quadratic complexity in vision transformer backbones. This limits recognition of small-scale content, fragmenting or blurring fine visual features crucial to tasks such as visual question answering (VQA) on images containing scene text or fine document elements. Increasing the VLM’s input resolution uniformly is prohibitive due to computational overhead. Prior approaches typically require architectural changes, fine-tuning (risking catastrophic forgetting or requiring model re-training), or annotation-heavy supervision—all barriers for practical or proprietary models.

CropVLM addresses these constraints as an “external zoom policy”: for each (image, question) pair, it predicts a crop (bounding box), generates this high-resolution crop from the original image, and passes both the global image and the crop as inputs to the frozen VLM for answer generation. This strategy preserves full-image context while presenting the VLM with enhanced detail for the most relevant region, without modifying the VLM or incurring prohibitive compute costs (Carvalho et al., 25 Nov 2025).

2. Cropping Policy Architecture and Task Formulation

CropVLM utilizes a compact, LoRA-adapted vision-language transformer (SmolVLM Instruct, ≈256M parameters) as a cropping policy. The architecture receives the full image I0I_0, the question or prompt qq, and (during training) the ground-truth answer or label. It outputs crop coordinates as a bounding box o=[x1,y1,x2,y2]o = [x_1, y_1, x_2, y_2], each normalized as a percentage of the image width and height ([0,100]) to facilitate resolution-invariant operation.

The cropping policy πθ(os)\pi_\theta(o|s) is instantiated by prompting the VLM with the question and an instruction to output a box: e.g., “{QUESTION} Outline the region… Output [x1,y1,x2,y2].” At inference, the predicted crop o^\hat{o} (using sampling or argmax decoding) defines the region IcI_c of interest. The system then queries the frozen target VLM with (I0,Ic,q)(I_0, I_c, q) to produce an answer. CropVLM is agnostic to the downstream VLM, enabling integration with open-source (e.g., LLaVA) or proprietary models (e.g., GPT-4.1 nano) (Carvalho et al., 25 Nov 2025).

3. Reinforcement Learning Training Procedure

CropVLM employs a two-stage training regime:

a. Supervised Fine-Tuning (SFT)

The policy is initially trained on a synthetic crop dataset (\approx62K samples) auto-generated via Qwen 2.5-VL. Coverage is ensured by expanding boxes with small areas. This familiarizes the model with producing syntactically valid and plausible crops.

b. Grouped Relative Policy Optimization (GRPO)

The primary optimization uses GRPO, a PPO variant with no value function. The expected reward objective is

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]

where RR is the reward. Two reward types are considered:

  • Accuracy-based:

Racc(I0,Ic,q,y)={1if VLM output matches y 0otherwiseR_{acc}(I_0, I_c, q, y^*) = \begin{cases} 1 & \text{if VLM output matches } y^* \ 0 & \text{otherwise} \end{cases}

  • Likelihood-based:

Rll(I0,Ic,q,a)=t=1TlogpVLM(atI0,Ic,q,a<t)R_{ll}(I_0, I_c, q, a^*) = \sum_{t=1}^T \log p_{VLM}(a^*_t | I_0, I_c, q, a^*_{<t})

For each batch, GG candidate crops are sampled per input. Raw rewards rir_i are standardized within their group as

Ai=riμrσrA_i = \frac{r_i - \mu_r}{\sigma_r}

Policy updates use

θJ1Gi=1GAiθlogπθ(oiI0,q)+βθH[πθ]\nabla_\theta J \approx \frac{1}{G} \sum_{i=1}^G A_i \nabla_\theta \log \pi_\theta(o_i | I_0, q) + \beta \nabla_\theta H[\pi_\theta]

where H[πθ]H[\pi_\theta] is the entropy regularizer (weight β\beta).

No human bounding boxes are used for RL training. The reward model is a 256M parameter SmolVLM evaluated at 512×512512\times512 px.

4. Practical Integration and Deployment

CropVLM operates as an external, stand-alone service. At test time, for each question/image pair, the full image and crop—proposed without any VLM fine-tuning—are packaged and sent to a frozen VLM under a minimal prompt (e.g., “{QUESTION} Give a very brief answer.”). This design prevents catastrophic forgetting and allows upgrading models without re-training the cropping policy, accommodating both open-source and closed-source VLMs (Carvalho et al., 25 Nov 2025).

5. Empirical Results and Evaluation

CropVLM demonstrates consistent performance gains on fine-grained and out-of-domain benchmarks. Key quantitative findings include:

Target VLM Resolution Baseline Acc. + CropVLM Acc.
SmolVLM 512 28.45 38.13
SmolVLM 1024 44.55 50.89
SmolVLM 2048 50.16 52.64
LLaVA-1.5 336 36.69 42.71
Qwen 2.5 VL 448 56.42 67.14
GPT-4.1 nano 512 41.27 47.41

Accuracy reflects average VQA accuracy across mixed TextVQA/ST-VQA/DocVQA/InfoVQA benchmarks. CropVLM at 2048px consistently outperforms explainability-based (ViCrop) and preference-DPO (UV-CoT) cropping baselines, despite using fewer training examples and a smaller model (Carvalho et al., 25 Nov 2025).

Ablation shows that omitting the full-image context alongside the crop substantially degrades performance (e.g., SmolVLM@2048 accuracy drops from 52.00 to 45.84). Additional analysis shows that GRPO-trained crops (relative to those from SFT) enclose larger, more recall-optimal regions, and that intersection-over-union (IoU) of crops with human boxes is a weak predictor of VQA accuracy—the crucial factor is recall.

6. Limitations, Modularity, and Design Choices

CropVLM avoids human annotation and can exploit both synthetic or model-generated initial crops. The two-stage (SFT/GRPO) training paradigm is robust: replacing synthetic boxes with exhaustive-search boxes in SFT yields similar final performance. The policy generalizes across resolutions due to normalization of crop coordinates. The modular design ensures broad compatibility with models for which weights may be inaccessible, and prevents catastrophic forgetting by avoiding target VLM fine-tuning. The system is optimized for modest compute and can be fully trained in approximately 27 hours on a single A100 GPU (3hr SFT, 24hr GRPO at 2048px).

A plausible implication is that CropVLM’s externalized crop policy architecture could serve as a generic plug-in mechanism for a broad spectrum of VLM-based perception and question-answering pipelines, especially where compute is constrained or detailed visual features are crucial.

7. Relation to Prior Work

CropVLM differs from approaches such as visual token pruning (e.g., CROP (Guo et al., 27 May 2025)), explainability-based cropping (ViCrop), or preference-based selection (UV-CoT) by using a learned, reward-optimized cropping policy trained independently of the target VLM weights. Unlike methods that require adaptation or retraining of the vision backbone, CropVLM is externally modular and agnostic to VLM internals.

By adopting a reinforcement learning approach with coarse initial supervision, CropVLM demonstrates that sophisticated, instance-adaptive pre-processing can yield state-of-the-art accuracy gains in fine-grained and cross-domain VQA, while minimizing both annotation and computational burden (Carvalho et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CropVLM.