PyVision: Agentic Vision with Dynamic Tooling (2507.07998v1)

Published 10 Jul 2025 in cs.CL, cs.AI, and cs.CV

Abstract: LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

Summary

The paper introduces a dynamic agentic framework that enables MLLMs to synthesize custom Python tools for visual reasoning.
It leverages iterative, code-based reasoning using Python's scientific libraries to enhance interpretability and adaptability.
Empirical results demonstrate significant performance gains over baselines, highlighting its effectiveness across diverse visual domains.

PyVision: Agentic Vision with Dynamic Tooling

PyVision introduces a novel agentic framework for multimodal LLMs (MLLMs), enabling them to autonomously generate, execute, and iteratively refine Python-based tools for visual reasoning tasks. Unlike prior approaches that rely on static toolsets or predefined workflows, PyVision leverages the coding capabilities of advanced MLLMs (e.g., GPT-4.1, Claude-4.0-Sonnet) to dynamically synthesize and employ custom visual tools tailored to each query and input. This paradigm shift allows for flexible, interpretable, and verifiable problem-solving in complex visual domains.

Framework Overview

PyVision operates as an interactive, multi-turn system where the MLLM receives a multimodal query, generates Python code to process the input (typically images or video), executes the code in an isolated runtime, and incorporates the results back into its context for further reasoning. This loop continues until the model produces a final answer. The system is designed to:

Encourage code-based reasoning: The system prompt instructs the MLLM to plan, generate, and reflect on code execution, using Python as the sole primitive tool.
Leverage Python’s scientific ecosystem: The model can access libraries such as OpenCV, Pillow, NumPy, Pandas, scikit-learn, and scikit-image, enabling a wide range of image processing and analysis operations.
Maintain process isolation and state: Each code block is executed in a subprocess, with cross-turn persistence of variables and state, supporting complex multi-step toolchains.
Facilitate structured I/O: Communication between the MLLM and runtime is handled via structured variable passing, avoiding direct file system dependencies.

Taxonomy and Patterns of Generated Tools

Analysis of code generated by PyVision across diverse benchmarks reveals a rich taxonomy of tools:

Basic Image Processing: Cropping, rotation, and contrast enhancement to focus on regions of interest or improve perceptual clarity.
Advanced Image Processing: Segmentation, detection, and OCR, dynamically constructed for mid- to high-level vision tasks.
Visual Prompting and Sketching: Rendering marks or lines on images to aid in counting, spatial reasoning, or geometric analysis.
Numerical and Statistical Analysis: Computing histograms, areas, perimeters, or other quantitative metrics for symbolic or mathematical reasoning.
Long-tail Operations: Creative, task-specific tools such as pixel-wise differencing for “spot the difference” tasks, demonstrating zero-shot tool synthesis.

The distribution of tool usage is highly task- and domain-dependent. For example, cropping dominates in fine-grained visual search, while numerical/statistical tools are prevalent in math and logic benchmarks. In medical imaging, contrast enhancement is frequently invoked, and segmentation is common in remote sensing.

Empirical Results

PyVision demonstrates consistent and significant performance improvements across a suite of challenging benchmarks:

Model	MathVista	MathVision-mini	MMMU	VisualPuzzles	VLMsAreBlind-mini	V*
GPT-4.1 (baseline)	69.9	46.4	71.9	44.9	67.1	68.1
PyVision-GPT-4.1	71.7 (+1.8)	48.7 (+2.3)	74.3 (+2.4)	47.4 (+2.5)	69.7 (+2.6)	75.9 (+7.8)
Claude-4.0-Sonnet	71.4	48.0	74.4	42.7	48.1	56.5
PyVision-Claude	76.2 (+4.8)	51.3 (+3.3)	74.6 (+0.2)	51.0 (+8.3)	79.2 (+31.1)	56.8 (+0.3)

Notably, PyVision yields a +7.8% gain on V* for GPT-4.1 and a +31.1% gain on VLMsAreBlind-mini for Claude-4.0-Sonnet. These improvements are not uniform; rather, they amplify the inherent strengths of the backend model—perceptual tasks benefit more with perceptually strong models, while abstract reasoning tasks see greater gains with models excelling in reasoning.

Case Studies

Several qualitative examples illustrate PyVision’s capabilities:

Visual Search: Iterative cropping and OCR to identify small text in cluttered scenes.
Medical Imaging: Contrast enhancement and histogram analysis for subtle abnormality detection.
Symbolic Puzzles: Edge detection and contour analysis for geometric counting tasks.
Visual Sketching: Diagrammatic reasoning for math and science problems.
Video Understanding: Selective frame analysis and evidence synthesis for object counting in egocentric video.

These cases highlight the interpretability and adaptability of the generated toolchains, as well as the system’s ability to decompose complex tasks into verifiable computational steps.

Implications and Future Directions

PyVision’s agentic, dynamic tooling framework marks a significant step toward more general, autonomous, and creative AI systems for visual reasoning. By enabling MLLMs to invent and execute custom tools on demand, the approach overcomes the rigidity of static toolsets and unlocks new levels of flexibility and transparency.

Practical implications include:

Enhanced interpretability: Each reasoning step is explicit and inspectable, facilitating debugging and trust in model outputs.
Domain adaptability: The system can rapidly adapt to new domains (e.g., medical, remote sensing) without retraining or manual tool engineering.
Reduced reliance on external models: By using Python as the universal tool interface, PyVision avoids bottlenecks associated with fixed visual parsers.

Theoretical implications involve the paper of agentic behavior in MLLMs, the emergence of tool synthesis as a core component of intelligence, and the interplay between perception and reasoning in multimodal agents.

Future research may explore:

Scaling to more complex, real-world tasks: Integrating PyVision with robotics, scientific discovery, or autonomous systems.
Improved safety and reliability: Ensuring robust code generation and execution, especially in safety-critical domains.
Meta-learning and tool reuse: Enabling models to remember, generalize, and optimize toolchains across tasks and sessions.
Integration with reinforcement learning: Incentivizing efficient tool use and exploration in open-ended environments.

In summary, PyVision demonstrates that dynamic, agentic tool generation is a powerful paradigm for advancing multimodal reasoning, with broad implications for both research and practical deployment of AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_Chen_Wei_/status/1943714790326571491

https://twitter.com/_akhaliq/status/1943751729130135912

https://twitter.com/chaumian/status/1943726597770555438

https://twitter.com/zst96687522/status/1943719331025555752

https://twitter.com/HuggingPapers/status/1944368354237493331

https://twitter.com/PapersInML/status/1943732831261274365

YouTube

Show All Videos