PyVision-RL: Multimodal Agentic Reinforcement Learning
- PyVision-RL is a reinforcement learning framework for multimodal agents that executes Python code to maintain agentic, multi-turn interactions in vision-language tasks.
- It uses an innovative oversampling-filtering-ranking rollout strategy to mitigate interaction collapse and improve training diversity while managing token efficiency.
- The unified training pipeline integrates image (PyVision-Image) and video (PyVision-Video) agents with a modified GRPO algorithm and accumulative tool reward to enhance multi-step reasoning.
PyVision-RL is a reinforcement learning (RL) framework for open-weight agentic multimodal models designed to sustain stable multi-turn tool use in vision-language tasks. Addressing the phenomenon of "interaction collapse," in which models minimize tool use and regress to degenerate, non-agentic dialogues during RL, PyVision-RL implements an oversampling-filtering-ranking rollout strategy paired with an accumulative tool reward mechanism. The system includes a unified training pipeline for developing both image (PyVision-Image) and video (PyVision-Video) agents, leveraging explicit Python code execution as primitive tool interaction and introducing on-demand context construction to reduce the computational cost of video-based reasoning (Zhao et al., 24 Feb 2026).
1. Objectives and Unified Design
PyVision-RL aims to train open-weight multimodal agents capable of (a) reasoning over images and videos, (b) executing multi-turn tool calls via Python, and (c) achieving stable RL-driven improvements. The system addresses several intrinsic challenges:
- Interaction collapse: Standard RL finetuning induces minimal code calling and short dialogue trajectories, undermining agentic reasoning.
- Unstable rollouts: Broken code execution or zero-variance rewards lead to unreliable gradients.
- Token inefficiency: Naive ingestion of lengthy videos exhausts context windows.
The framework employs a unified pipeline:
- Primitive tool interface: Agents interleave natural language and executable Python code blocks; code is executed in a sandbox with multimodal outputs injected into the LLM context.
- Dual scaffold design:
- PyVision-Image: Images are available both to the LLM and the Python runtime.
- PyVision-Video: Videos are loaded into the Python runtime; agents must explicitly sample and plot frames using on-demand context construction via code.
Training is structured as: (1) supervised finetuning (SFT) using synthetic tool-use traces (7K images, 44K videos), followed by (2) RL with a modified GRPO algorithm, utilizing mixed datasets for visual search, multimodal reasoning, and spatial reasoning.
2. Rollout Selection: Oversampling–Filtering–Ranking
To mitigate rollout collapse and improve training signal diversity, PyVision-RL introduces an oversampling–filtering–ranking strategy. Given a prompt pool of size , the procedure is:
- Prompt oversampling: Sample prompts , .
- Group rollouts: For each , generate rollouts , .
- Code execution & reward: Execute all generated Python code; mark and filter broken rollouts. Compute reward .
- Difficulty estimation: For each group, calculate group mean and standard deviation .
- Filter groups: Discard groups with or only broken rollouts.
- Ranking: Rank groups by descending (favoring moderate-difficulty prompts).
- Batch selection: Select top groups and their valid rollouts for training batches.
This process ensures exposure to both difficult and instructive trajectories, avoids training on trivial or broken rollouts, and stabilizes policy improvement.
3. Accumulative Tool Reward and Agentic Behavior
The accumulative tool reward mechanism addresses tool-avoidant collapse by explicitly rewarding multi-turn tool use conditional on correct answers. The final reward is defined as:
where indicates answer correctness, and is the number of Python tool calls. This approach:
- Encourages extended, tool-centric reasoning chains only when they are correct.
- Discourages spurious or random tool usage by zeroing out the tool-use bonus on incorrect final answers.
- Empirically increases average tool call count and dialogue length without reducing accuracy.
Ablations show that omission of the accumulative tool reward leads to rapid collapse in tool use; its inclusion stabilizes long-horizon, multi-step agentic behavior.
4. On-Demand Visual Context Construction
To address prohibitive token costs associated with naively loading video content into the LLM, PyVision-Video delegates frame selection to explicit code calls. The process operates as follows:
- The full video is accessible only to the Python runtime as
video_clue_0. - The agent generates Python calls such as
fetch_frames_and_plot(start, end, num_frames)to retrieve and visualize sequences of frames. - Visualized frames are injected as image tokens back into the LLM context (
mm_clue_t).
Formally, at each round :
- The LLM, conditioned on , produces code specifying a frame selection.
- Python extracts and visualizes the frames, yielding visual tokens appended to .
- The loop repeats until an answer is produced or the turn budget is exhausted.
This explicit, agent-driven context construction reduces context token usage by up to 9× compared to naive strategies, while preserving the agent’s autonomy to select task-relevant evidence.
5. Model Architecture and Training Protocol
The underlying model is Qwen2.5-VL-7B, a 7B-parameter transformer-based multimodal LLM (MLLM). Key characteristics include:
- Tokenization: Byte-pair encoding for text; images embedded via a frozen vision encoder.
- Fusion: Transformer blocks with cross-attention support for image tokens.
- Video scaffolding: Video frames are not directly embedded as tokens; only frames retrieved at runtime enter the LLM context.
Training regimen:
- SFT: 1 epoch across 7K image and 44K video examples using LLaMA-Factory.
- RL: 700 steps, (32 prompts per oversample), , , batch size up to $128$, learning rate , optimizer AdamW, -normalization and PPO-style clipping omitted, max context length $32$K, max turn budget $4$, using 8×NVIDIA H100 GPUs.
6. Empirical Results and Analysis
6.1 Benchmarks
Evaluation covers visual search (V*, HRBench-4K/8K), multimodal math (DynaMath, MathVerse, MathVision, WeMath), agentic reasoning (TIR-Bench), and video spatial reasoning (VSI-Bench).
6.2 Quantitative Outcomes
| Method | V* | HR4K | HR8K | DynaMath | MathVerse | WeMath | TIR-Bench |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 78.5 | 71.6 | 67.9 | 53.3 | 45.6 | 34.6 | 16.0 |
| Static Tools | 81.8 | 77.9 | 73.8 | 57.2 | 52.7 | 38.1 | — |
| PyVision-Image | 88.7 | 78.1 | 74.3 | 61.6 | 55.8 | 47.7 | 19.8 |
On video reasoning (VSI-Bench):
| Method | Avg | Token Usage |
|---|---|---|
| Qwen2.5-VL-7B (1 FPS) | 38.0% | 45K |
| VITAL | 41.8% | 20K |
| PyVision-Video | 44.0% | 5K |
Notable ablation findings:
- Increasing max turn budget from 2 to 4 yields +1.9% on V* at 600 steps.
- Accumulative tool reward delivers +1.9% on V* at step 500; removal results in tool use collapse.
- Standard-deviation ranking of rollouts reduces "positive with negative advantage" samples and enhances early-stage stability.
- Removing -normalization from the advantage calculation smooths the advantage distribution and improves training stability.
7. Limitations and Prospects
Current limitations encompass:
- Potential security risks and runtime instability due to Python sandbox execution (timeouts, crashes).
- On-demand frame sampling is susceptible to missing key video evidence if agent selection heuristics are suboptimal.
- RL finetuning is compute-intensive, with training empirically restricted to 700 steps; scalability to larger models or longer videos remains unresolved.
Areas for future development include:
- Integration of learned critics or value networks to stabilize advantage estimation.
- Exploration of uncertainty-aware or attention-guided frame selection strategies to prevent omission of critical visual context.
- Extension to additional modalities (audio, 3D point clouds) and multi-agent interactive scenarios.
- Implementation of curriculum RL schedules to gradually increase turn budgets and task difficulty (Zhao et al., 24 Feb 2026).