Visual Program Distillation (VPD)
- Visual Program Distillation (VPD) is an instruction-tuning framework that distills LLM-generated visual programs into a single vision-language model for efficient end-to-end reasoning.
- It employs a four-phase process—program generation, execution, chain-of-thought conversion, and model distillation—to reduce error propagation and computational latency.
- VPD achieves state-of-the-art performance on diverse VQA benchmarks by integrating compositional reasoning, spatial analysis, counting, and external knowledge retrieval.
Visual Program Distillation (VPD) is an instruction-tuning framework designed to imbue vision-LLMs (VLMs) with programmatic reasoning abilities traditionally achieved by explicit tool-augmented pipelines. VPD enables a single VLM to emulate hybrid LLM–specialist toolchains such that complex visual tasks—requiring spatial understanding, counting, and external knowledge access—can be solved in a single forward pass. The approach systematically distills the compositional reasoning steps of LLM-generated visual programs, together with their outputs, into the VLM, resulting in models that achieve state-of-the-art performance on diverse evaluation benchmarks while reducing computation and latency overheads (Hu et al., 2023).
1. Motivations and Precedent
Visual program distillation arises from the need to address the critical weaknesses of prior hybrid systems connecting LLMs and specialist vision tools for solving compositional visual question answering (VQA) tasks. These weaknesses include:
- Error propagation: LLM-generated top-1 programs often omit or misorder key steps. Failure or error in any tool step disrupts the entire reasoning chain.
- Computational inefficiency: Each query necessitates dynamic orchestration of multiple high-latency models (e.g., object detectors, depth estimators, OCR, knowledge retrieval modules), compounding both compute and latency costs.
- Inferior end-to-end performance: Empirically, even specialist-driven pipelines frequently underperform compared to end-to-end–fine-tuned VLMs on standard VQA metrics.
VPD is motivated by compositional reasoning requirements, encompassing spatial analysis (e.g., "which object is on the right?"), counting (e.g., "how many bicycles are in the image?"), and retrieval of extra-visual knowledge (e.g., "Who invented this instrument?")—all tightly evaluated in benchmarks such as VQAv2, OK-VQA, A-OKVQA, GQA, and TallyQA (Hu et al., 2023).
2. Core Technical Process
The VPD framework comprises a four-phase pipeline that transitions from program synthesis to single-model distillation:
- Program generation:
- LLM (PaLM-2) is prompted with a query and tool context to yield diverse candidate programs (, ), each encoding a decomposition of the target visual reasoning task.
- .
- Program execution and automated supervision:
- Each program from is executed on the visual input via an execution engine , producing output and a detailed execution trace .
- If ground truth exists, outputs where are discarded. Among valid programs, the one with maximal LLM score is retained for supervision. In the absence of ground truth, and filtering is bypassed.
- Trace-to-natural chain-of-thought (CoT) conversion:
- The execution trace is rewritten to human-readable stepwise rationale via few-shot prompting (PaLM-2), yielding explanations such as: "detect all buses", "filter by yellow", "count", "return answer".
- VLM distillation and instruction tuning:
- The distilled dataset is constructed, with both short-answer and rationale annotations.
- Fine-tuning on with a loss function:
where denotes sequence-length normalized cross-entropy. - At inference, only the succinct answer prompt is typically used.
3. Model Architectures and Distillation Details
VPD is instantiated on two VLM backbones:
- PaLI-3 (5B): 2B-parameter SigLIP vision encoder; 3B-parameter UL2 text-vision Transformer (812×812 pixel images).
- PaLI-X (55B): 22B-parameter ViT vision encoder; 32B-parameter UL2 Transformer (756×756 pixel images).
Parameter-efficient tuning is achieved via LoRA adapters (rank=8 for generalist; 4–8 for specialists) on all attention and MLP layers. The AdamW optimizer is used (, ), with a peak learning rate of (cosine schedule, 1% warmup). Generalist fine-tuning uses a batch size of 128 and 8K steps; specialist fine-tuning, batch size 64 for 1–3 epochs.
The training data pool consists of a mixture of established VQA and compositional datasets:
- VQAv2 (100K labels)
- OCR-VQA (50K)
- GQA compositional (86K; 38K with CoTs)
- OK-VQA (9K; 6.7K with CoTs)
- A-OKVQA (17.1K; 11.2K with CoTs)
- TallyQA (48.4K; 33.7K with CoTs)
4. Empirical Results and Analysis
The VPD-trained PaLI-X model establishes new state-of-the-art (SOTA) results across a diverse benchmark suite. In zero-shot and multi-task scenarios, PaLI-X-VPD (55B) achieves absolute gains over PaLI-X-Instruct (55B) on all tasks, with especially pronounced improvements on highly compositional or knowledge-intensive benchmarks (e.g., GQA: +1.6, TallyQA complex: +1.2, overall MMBench:+1.2). Table 1 summarizes representative zero-shot benchmark results:
| Model | VQAv2 | GQA | OK-VQA | A-OKVQA(MC) | TallyQA (complex) | MMBench |
|---|---|---|---|---|---|---|
| PaLI-X-Instruct (55B) | 83.6 | 63.3 | 64.3 | 84.1 | 75.4 | 75.0 |
| PaLI-X-VPD (55B) | 83.9 | 64.9 | 64.6 | 84.5 | 76.6 | 76.2 |
Specialist fine-tuning further increases performance:
- GQA: 67.3
- OK-VQA: 66.8
- A-OKVQA(MC): 85.2, A-OKVQA(DA): 71.1
- TallyQA (complex): 76.4
Ablation studies demonstrate that sampling candidate programs rather than increases the occurrence of correct teacher programs from ≈30% to ≈75% (GQA/A-OKVQA). The distilled VLM consistently outperforms its own programmatic teacher on all tasks, supporting the efficacy of knowledge transfer.
Human evaluation on 600 samples (GQA, A-OKVQA, TallyQA) found that PaLI-X-VPD yields higher answer correctness (+16.7% absolute), more frequent explanations (+24%), and greater factuality (+14.6%) and consistency (+10%) among explained cases relative to baseline PaLI-X-Instruct. VPD responses were preferred in 25% more examples overall, and 12% more when both compared answers were correct.
5. Rationale Generation and Qualitative Examination
VPD enables the VLM to generate stepwise rationales (CoTs) analogous to detailed tool-call traces of LLM-generated programs. For example, in response to the query "Who invented the musical instrument on the right?", the model produces a coherent multistep justification: (1) detect all instruments, (2) filter bounding boxes by right-half coordinates, (3) classify instrument as "guitar", (4) retrieve inventor information. The distilled VLM synthesizes these steps within a single pass, allowing both succinct answer and full rationale to be produced without external tool invocation (Hu et al., 2023).
6. Real-World Adaptation and Application
On real-world vision-language moderation tasks (e.g., Hateful Memes), VPD yields substantial gains:
- Supervised: PaLI-X-VPD (with CoT) achieves 80.8% accuracy, 89.2% AUC (SOTA; near the human performance of 84.7% accuracy).
- Zero-shot (no human labels): Outperforms all prior VLMs, and surpasses program-only pipeline performance (69.7%/70.1%).
This demonstrates that VPD is effective even for transfer scenarios with limited or absent labeled data, and can be adapted for downstream application by leveraging generalist or specialist fine-tuning as appropriate (Hu et al., 2023).
7. Limitations and Prospects
Current scaling of VPD is constrained by reliance on established labeled VQA datasets for supervision. Extension to LLM-generated synthetic queries is theoretically feasible, but requires robust multimodal fact-checking and filtering to ensure distillation quality. Furthermore, VPD presently distills fixed, static programs; certain complex tasks may require introducing agentic, interactive planning rather than one-shot script execution. The integration of richer vision tools—including segmentation models, attribute-level detectors, and 3D reasoning modules—remains an open vector for further enhancing the compositional and perceptual breadth of distilled VLMs (Hu et al., 2023).