Papers
Topics
Authors
Recent
2000 character limit reached

Vision-Language Programs (VLP)

Updated 1 December 2025
  • Vision-Language Programs (VLP) are a neuro-symbolic framework that integrates visual-language models with explicit DSL program synthesis for systematic reasoning.
  • The methodology involves symbol grounding, PCFG-based program induction, and accuracy-driven program selection to generate interpretable, executable programs.
  • Empirical evaluations reveal significant accuracy gains (up to +26%) and enhanced transparency in challenging visual reasoning tasks.

A Vision-Language Program (VLP) is a neuro-symbolic framework that synthesizes explicit, executable programs—defined over visual input—from multimodal vision-LLMs. Rather than relying solely on end-to-end neural inference, a VLP extracts structured visual concepts from a pretrained vision-LLM (VLM), induces formal rules in a domain-specific language (DSL), and compiles these into symbolic programs for systematic visual reasoning. This paradigm creates human-interpretable models that retain the robust perception of VLMs yet achieve rigorous logical consistency and transparency by explicit program synthesis and execution (Wüst et al., 24 Nov 2025).

1. Formalization and Symbol Grounding

A Vision-Language Program is constructed for few-shot visual reasoning tasks, where each example is a labeled pair {(Ii,yi)}\{(I_i, y_i)\} with IiI_i an image and yi{0,1}y_i\in\{0,1\} a binary label. VLP comprises three stages:

  1. Symbol Grounding: Given the task set X\mathcal{X}, a pretrained VLM M\mathcal{M} proposes, for each symbol category G{object,property,action}G\in\{\text{object}, \text{property}, \text{action}\}, a set of task-specific ground symbols EGE_G:

M(G,X)=EG\mathcal{M}(G,\mathcal{X}) = E_G

Aggregated over all categories, the symbol pool E=GEG\mathcal{E} = \bigcup_G E_G provides the primitive vocabulary for subsequent program induction.

  1. Program Synthesis: From E\mathcal{E} and a fixed DSL, VLP builds a Probabilistic Context-Free Grammar (PCFG):

G=(N,T,R,S,P)\mathcal{G} = (N, T, R, S, P)

  • NN: nonterminal types, e.g. {image,bool,int}\{\mathtt{image}, \mathtt{bool}, \mathtt{int}\} plus symbol categories
  • TT: terminals (ground symbols from E\mathcal{E})
  • RR: productions over VLM functions (object/action extraction), symbolic primitives (existence, counting), and logical operators (and,or,not\mathtt{and}, \mathtt{or}, \mathtt{not}, comparators)
  • S=boolS = \mathtt{bool}: the start type (all synthesized programs must output a Boolean)
  • PP: rule probabilities, with symbol terminals weighted by frequency in positive/negative examples

Valid programs are generated by expanding SS to a type-correct DSL expression pp using these rules; each program p:Image{True,False}p: \mathtt{Image}\to\{\mathtt{True},\mathtt{False}\} can be directly evaluated against an image.

  1. Program Selection: Each candidate program pp is scored on training examples by accuracy:

Acc(p)=1ni=1n1[p(Ii)=yi]\mathrm{Acc}(p) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[p(I_i)=y_i]

Ties are broken by the program’s prior P(p)=rrules(p)P(r)P(p) = \prod_{r\in\text{rules}(p)} P(r). The top-ranked program pp^* is selected for downstream execution or interpretation.

2. Pipeline and DSL Structure

The VLP pipeline is as follows:

  • Symbol Extraction: VLM is prompted (per type and per task) to enumerate objects, properties, and actions present in labeled images. This pool is then incorporated as the terminal set for program induction.
  • Grammar Instantiation: The DSL includes:
    • Typed data (image, bool, int, object, property, action)
    • Visual functions: object/action/property extractors, attribute checkers, counting
    • Logical operators: conjunction, disjunction, negation, comparison
  • Program Synthesis/Search: Depth-limited PCFG search (heap or beam search) with evaluation of all candidate programs over the examples, leveraging both correctness and prior-based weighting.
  • Execution: The induced program pp^* is transparent and fully executable, yielding not only predictions on held-out images but programmatic explanations of the underlying concept.

3. Quantitative Evaluation and Empirical Results

VLPs have been benchmarked on a spectrum of visual reasoning datasets that probe systematic generalization, compositionality, and logical inference:

Model AVG Bongard-HOI Bongard-OpenWorld Bongard-RWR COCOLogic CLEVR-Hans3
InternVL3-8B 57.4 60.5 59.2 47.2 71.5 48.3
InternVL3-8B+VLP 70.9 77.7 67.5 53.9 81.0 74.4
Qwen3-VL-30B 63.4 69.0 68.5 55.8 73.9 50.0
Qwen3-VL-30B+VLP 68.9 74.5 66.3 58.3 79.1 66.1

The VLP approach yields gains up to +26 percentage points on tasks with complex logical structure (CLEVR-Hans3), and robust improvements across compositional, abstract, and real-world concept induction. Crucially, VLP is model-agnostic and does not require any domain-specific detectors (Wüst et al., 24 Nov 2025).

4. Objectives, Limitations, and Ablations

The main objective is maximization of accuracy on few-shot labeled pairs, with program prior as a secondary criterion. Symbol priors are weighted by positive/negative occurrence frequency: P(e)npos(e)+εnpos(e)+nneg(e)+εP(e)\propto \frac{n_\mathrm{pos}(e)+\varepsilon}{n_\mathrm{pos}(e)+n_\mathrm{neg}(e)+\varepsilon} Ablations indicate the necessity of occurrence weighting for symbol selection (+0.7 average percentage points), and the inability of structured-prompting baselines (without symbolic search) to match VLP performance, confirming the role of explicit program induction in systematic generalization. Limitations include errors in VLM-based symbol extraction, incomplete symbol vocabularies for challenging concepts (e.g., size attributes in synthetic environments), and sensitivity to dataset label noise (Wüst et al., 24 Nov 2025).

5. Human-Interpretable Explanations and Shortcut Mitigation

A key advantage of VLPs over purely neural prediction is transparency. Induced programs are interpretable and auditable: users can inspect the sub-calls (e.g., the output of get_objects), diagnose spurious shortcuts (e.g., reliance on color over semantic class), and intervene directly by modifying the DSL or filtering symbol pools. For instance, in CLEVR-Hans3, removal of superficial cues (color words) from the property set forces the synthesizer to produce the correct compositional rule, resulting in an accuracy boost of +13 percentage points.

6. Relationship to Neuro-Symbolic Reasoning and Vision-Language Modeling

VLPs situate themselves between two axes: perceptual flexibility (inherited from VLMs) and rigorous, modular reasoning (embodied by symbolic program synthesis). This hybrid approach avoids the rigidness of classical neuro-symbolic models that rely on engineered perception pipelines, and instead leverages large VLMs for grounding, while imposing the formality and interpretability of explicit DSL programs for the reasoning component. The resulting system can be incrementally debugged, extended, or made domain-adaptive by DSL or symbol modifications, and serves as a foundation for “explainable” multimodal reasoning under realistic data regimes (Wüst et al., 24 Nov 2025).

7. Extensions and Outlook

Given the training-free, compositional nature of VLPs and their reliance on scalable VLM perception, plausible future directions include extension to few-shot or zero-shot learning in entirely novel domains, plug-and-play adaptation to new symbol taxonomies, and integration with interactive or corrective user feedback. VLPs provide a template for principled explainability in vision-language intelligence by bridging perceptual AI and formal reasoning, with substantial implications for transparency, robustness, and systematicity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vision-Language Programs (VLP).