CropVLM: RL Cropping for Vision-Language Models
- CropVLM is a cropping policy that improves fine-grained vision-language perception by dynamically zooming in on high-relevance image regions.
- It employs a two-stage training process combining synthetic crop supervision and reinforcement learning via GRPO to optimize crop selection.
- The modular design enhances VQA performance with minimal compute and annotation overhead while preserving full-image context.
CropVLM is an external, reinforcement-learning-based cropping policy designed to improve fine-grained vision-language perception in state-of-the-art Vision-LLMs (VLMs) without modifying the VLMs themselves. It enables dynamic “zooming in” on high-relevance image regions at inference, enhancing tasks such as scene-text recognition or document understanding where small scale details are critical. CropVLM produces significant accuracy improvements with minimal compute or annotation overhead by proposing one task-adaptive, high-resolution crop per image, in addition to the global view. The policy is trained using only synthetic or model-generated crops for supervised pre-training and then optimized using reinforcement learning—specifically Grouped Relative Policy Optimization (GRPO)—without requiring human-labeled bounding boxes or changes to the target VLM weights (Carvalho et al., 25 Nov 2025).
1. Problem Motivation and System Overview
Modern Transformer-based VLMs accept only modest-resolution image inputs (typically 224×224 or 336×336 px) due to quadratic complexity in vision transformer backbones. This limits recognition of small-scale content, fragmenting or blurring fine visual features crucial to tasks such as visual question answering (VQA) on images containing scene text or fine document elements. Increasing the VLM’s input resolution uniformly is prohibitive due to computational overhead. Prior approaches typically require architectural changes, fine-tuning (risking catastrophic forgetting or requiring model re-training), or annotation-heavy supervision—all barriers for practical or proprietary models.
CropVLM addresses these constraints as an “external zoom policy”: for each (image, question) pair, it predicts a crop (bounding box), generates this high-resolution crop from the original image, and passes both the global image and the crop as inputs to the frozen VLM for answer generation. This strategy preserves full-image context while presenting the VLM with enhanced detail for the most relevant region, without modifying the VLM or incurring prohibitive compute costs (Carvalho et al., 25 Nov 2025).
2. Cropping Policy Architecture and Task Formulation
CropVLM utilizes a compact, LoRA-adapted vision-language transformer (SmolVLM Instruct, ≈256M parameters) as a cropping policy. The architecture receives the full image , the question or prompt , and (during training) the ground-truth answer or label. It outputs crop coordinates as a bounding box , each normalized as a percentage of the image width and height ([0,100]) to facilitate resolution-invariant operation.
The cropping policy is instantiated by prompting the VLM with the question and an instruction to output a box: e.g., “{QUESTION} Outline the region… Output [x1,y1,x2,y2].” At inference, the predicted crop (using sampling or argmax decoding) defines the region of interest. The system then queries the frozen target VLM with to produce an answer. CropVLM is agnostic to the downstream VLM, enabling integration with open-source (e.g., LLaVA) or proprietary models (e.g., GPT-4.1 nano) (Carvalho et al., 25 Nov 2025).
3. Reinforcement Learning Training Procedure
CropVLM employs a two-stage training regime:
a. Supervised Fine-Tuning (SFT)
The policy is initially trained on a synthetic crop dataset (62K samples) auto-generated via Qwen 2.5-VL. Coverage is ensured by expanding boxes with small areas. This familiarizes the model with producing syntactically valid and plausible crops.
b. Grouped Relative Policy Optimization (GRPO)
The primary optimization uses GRPO, a PPO variant with no value function. The expected reward objective is
where is the reward. Two reward types are considered:
- Accuracy-based:
- Likelihood-based:
For each batch, candidate crops are sampled per input. Raw rewards are standardized within their group as
Policy updates use
where is the entropy regularizer (weight ).
No human bounding boxes are used for RL training. The reward model is a 256M parameter SmolVLM evaluated at px.
4. Practical Integration and Deployment
CropVLM operates as an external, stand-alone service. At test time, for each question/image pair, the full image and crop—proposed without any VLM fine-tuning—are packaged and sent to a frozen VLM under a minimal prompt (e.g., “{QUESTION} Give a very brief answer.”). This design prevents catastrophic forgetting and allows upgrading models without re-training the cropping policy, accommodating both open-source and closed-source VLMs (Carvalho et al., 25 Nov 2025).
5. Empirical Results and Evaluation
CropVLM demonstrates consistent performance gains on fine-grained and out-of-domain benchmarks. Key quantitative findings include:
| Target VLM | Resolution | Baseline Acc. | + CropVLM Acc. |
|---|---|---|---|
| SmolVLM | 512 | 28.45 | 38.13 |
| SmolVLM | 1024 | 44.55 | 50.89 |
| SmolVLM | 2048 | 50.16 | 52.64 |
| LLaVA-1.5 | 336 | 36.69 | 42.71 |
| Qwen 2.5 VL | 448 | 56.42 | 67.14 |
| GPT-4.1 nano | 512 | 41.27 | 47.41 |
Accuracy reflects average VQA accuracy across mixed TextVQA/ST-VQA/DocVQA/InfoVQA benchmarks. CropVLM at 2048px consistently outperforms explainability-based (ViCrop) and preference-DPO (UV-CoT) cropping baselines, despite using fewer training examples and a smaller model (Carvalho et al., 25 Nov 2025).
Ablation shows that omitting the full-image context alongside the crop substantially degrades performance (e.g., SmolVLM@2048 accuracy drops from 52.00 to 45.84). Additional analysis shows that GRPO-trained crops (relative to those from SFT) enclose larger, more recall-optimal regions, and that intersection-over-union (IoU) of crops with human boxes is a weak predictor of VQA accuracy—the crucial factor is recall.
6. Limitations, Modularity, and Design Choices
CropVLM avoids human annotation and can exploit both synthetic or model-generated initial crops. The two-stage (SFT/GRPO) training paradigm is robust: replacing synthetic boxes with exhaustive-search boxes in SFT yields similar final performance. The policy generalizes across resolutions due to normalization of crop coordinates. The modular design ensures broad compatibility with models for which weights may be inaccessible, and prevents catastrophic forgetting by avoiding target VLM fine-tuning. The system is optimized for modest compute and can be fully trained in approximately 27 hours on a single A100 GPU (3hr SFT, 24hr GRPO at 2048px).
A plausible implication is that CropVLM’s externalized crop policy architecture could serve as a generic plug-in mechanism for a broad spectrum of VLM-based perception and question-answering pipelines, especially where compute is constrained or detailed visual features are crucial.
7. Relation to Prior Work
CropVLM differs from approaches such as visual token pruning (e.g., CROP (Guo et al., 27 May 2025)), explainability-based cropping (ViCrop), or preference-based selection (UV-CoT) by using a learned, reward-optimized cropping policy trained independently of the target VLM weights. Unlike methods that require adaptation or retraining of the vision backbone, CropVLM is externally modular and agnostic to VLM internals.
By adopting a reinforcement learning approach with coarse initial supervision, CropVLM demonstrates that sophisticated, instance-adaptive pre-processing can yield state-of-the-art accuracy gains in fine-grained and cross-domain VQA, while minimizing both annotation and computational burden (Carvalho et al., 25 Nov 2025).