PhysToolBench Evaluation Framework
- PhysToolBench is an evaluation framework for benchmarking MLLMs' capability in recognizing, understanding, and using physical tools across diverse domains.
- It utilizes a rigorously curated dataset of 1,000 image–text pairs generated via a three-stage pipeline, ensuring tasks span easy to hard difficulty levels.
- The framework employs metrics like accuracy, precision, recall, and composite scoring to reveal performance gradients and diagnostic failure modes in state-of-the-art models.
PhysToolBench is an evaluation framework specifically designed to benchmark the ability of Multimodal LLMs (MLLMs) to recognize, understand, and inventively utilize physical tools. This benchmark uniquely quantifies MLLMs' comprehension of tools, which is a fundamental aspect of embodied intelligence and critical for advanced Vision-Language-Action (VLA) and general-purpose agent deployment. PhysToolBench organizes a suite of challenging, systematically tiered tasks based on visual question answering (VQA), exposing both the strengths and persistent deficiencies of state-of-the-art MLLMs in this domain (Zhang et al., 10 Oct 2025).
1. Dataset Design and Structure
PhysToolBench consists of 1,000 rigorously curated image–text pairs across four fundamental domains: Daily Life, Industrial, Outdoor Activities, and Professional Settings. The dataset was constructed via a three-stage pipeline:
- Phase 1 (Conceptualization): A pool of 1,500 expert-authored task–scene descriptions was refined down to 1,000 final instances.
- Phase 2 (Image Generation): Approximately 90% of images were generated using GPT-4o-image with human-in-the-loop prompt refinement. The remaining 10% are photographs of real scenes.
- Phase 3 (Annotation & Verification): Each numbered object within a scene received expert-assigned numeric labels. A separate audit team ensured image quality, annotation consistency, and correctness.
Each data instance is associated with one of three difficulty levels, reflecting progressively deeper forms of tool knowledge.
| Difficulty | Task Category | Example Task |
|---|---|---|
| Easy | Tool Recognition | Select the knife for cutting vegetables |
| Medium–M.1 | Attribute Understanding | Choose the most heat-resistant skillet |
| Medium–M.2 | Tool Combination | Select flashlight and batteries |
| Medium–M.3 | Availability | Detect that all plungers are broken (“None” as ground truth) |
| Hard | Tool Creation | Substitute a screwdriver with a coin |
2. Evaluation Protocols and Metrics
Benchmark performance is assessed via the following metrics and procedures:
- Accuracy: For single-label tasks,
where is the total number of instances.
- Precision/Recall/F1: Applied to multi-label sub-tasks (notably M.2 Tool Combination), with standard definitions for , , and .
- Composite Overall Score:
- Experimental Setup: All models receive a unified system prompt, which (a) clarifies that only the provided objects are available, (b) instructs stepwise reasoning (Chain-of-Thought), and (c) requires a categorical output or "None". Evaluations are strictly zero-shot; no in-context exemplars are included. Open-source and specialized models are run locally (NVIDIA A100 GPUs, PyTorch/HuggingFace stack); proprietary models are accessed via official APIs.
3. Model Coverage and Reasoning Modes
A total of 32 MLLMs were systematically benchmarked, including:
- Proprietary General-Purpose: GPT-5, o3, GPT-4o, Claude 3.7 Sonnet-thinking, Gemini-2.5-pro, Grok-4
- Open-Source General-Purpose: Qwen-2.5-VL (72B/32B/7B/3B), InternVL-3.5, InternVL-3, GLM-4.5V-108B, Ovis-2, DeepSeek-VL-2, Kimi-VL-A3B-thinking
- Embodied-Specific: RoboBrain-2, Embodied-R1, Magma-8B (agent-centric fine-tuning)
- VLA Backbone Models: Prismatic-7B (OpenVLA), PaliGemma-3B (π₀), Qwen-2-VL-2B (DexVLA), Phi-3-Vision-4B (TraceVLA)
Experimentation also compared three reasoning modalities:
- Standard zero-shot prompting
- Chain-of-Thought (CoT) prompting
- Vision-Centric Reasoning (described in Section 6)
4. Empirical Findings
Human annotators achieve 87.85–93.19% overall accuracy. The highest performing MLLM (GPT-5) achieves 62.15%. There is a marked performance gradient across both task difficulty and domain. Table: Partial per-level results for highlighted models:
| Model | Easy | M.1 | M.2 | M.3 | Hard |
|---|---|---|---|---|---|
| GPT-5 | 90.16% | 63.83% | 50.35% | 36.75% | 46.04% |
| GPT-4o | 86.03% | 70.74% | 48.23% | 35.54% | 44.06% |
| o3 | 93.02% | 67.02% | 46.81% | 22.89% | 49.50% |
| GLM-4.5V | 90.48% | 65.43% | 36.88% | 16.27% | 35.15% |
Scene-category performance varies: Professional and Industrial scenes reach a best-case ~67–68% but drop as low as ~18%; Outdoor ~13–61% and Daily Life ~4–58%.
Key findings:
- Model size strongly correlates with performance: Open-source models >10B parameters show emergent tool recognition competence (60–70% on Easy).
- Long-tail recognition remains problematic: Models fail consistently on rare or visually similar tools (e.g., HDMI vs. DisplayPort).
- Embodied model fine-tuning yields negligible benefit: RoboBrain2 and Embodied-R1 perform no better and sometimes worse than their backbone models.
- Severe “availability hallucinations”: On M.3 tasks (detecting brokenness/functionality), models often achieve less than 10% accuracy.
- VLA backbone models are consistently weak (<15% overall), raising concerns about their foundational “common sense.”
- CoT prompting provides limited improvement (~3–6% overall), with some open-source models’ “thinking” modes outperforming others by ~10%.
5. Failure Modes and Diagnostic Insights
Frequent errors include misidentifying subtle visual features (e.g., Type-C vs. Lightning plugs), failing to detect tool breakage (cracked plunger judged usable), and incorrectly reasoning about physical compatibility (mismatching screwdriver heads with screws). Open-source models underperform on more intricate reasoning tasks even when overall scale is substantial. Chain-of-Thought prompting offers modest gains, but several reasoning bottlenecks persist, particularly in tasks where non-obvious visual cues, latent affordances, or disqualifying evidence must be integrated.
6. Preliminary Solutions: Vision-Centric Reasoning
A new vision-centric reasoning pipeline was proposed to remedy text-dominant limitations of standard CoT. This pipeline comprises:
- Global Analysis: Joint parsing of prompt and the entire image for contextual grounding.
- Object Cropping & In-depth Analysis: DINOX object detection is used to crop each numbered tool; each crop is subjected to focused re-analysis.
- Multi-level Integration: Outputs from both global and per-object analysis are fused for final prediction.
Notably, this strategy increased M.3 (availability) accuracy for GPT-4o from 35.54% to 45.78% (+10.24%), and for GPT-5 from 36.75% to 54.81% (+18.06%). Ablation studies confirm that vision-centric reasoning outperforms CoT alone on availability tasks by 10–18%. This suggests that more integrated visual-object reasoning is crucial for affordance and functionality assessment.
7. Open Challenges and Research Directions
PhysToolBench highlights multiple directions for future research:
- Dataset Expansion: Incorporate 3D, video, and actual robot operation scenarios, which are expected to reveal further gaps in spatiotemporal tool reasoning.
- Modeling Innovations: Pursue architectures integrating graph-based reasoning over object-part relationships and tight coupling with physics simulators.
- Fine-tuning Regimens: Curate tool-centric text and image corpora, as well as logs from annotated robot executions, to target tool-specific learning.
- Safety-Critical Reasoning: Mitigate hallucinations, particularly in tool availability, to prevent dangerous errors (e.g., recommending use of broken or unsafe equipment).
- Benchmark Evolution: Extend scenarios to require temporal sequencing of tool use, multi-agent collaboration, and richer environmental diversity.
PhysToolBench thus establishes a new evaluation paradigm for probing the depth and reliability of MLLMs’ tool-related physical commonsense, defining the standard for measuring and advancing embodied intelligence (Zhang et al., 10 Oct 2025).