ARMBench-VL: Multimodal Reward Benchmark

Updated 25 January 2026

ARMBench-VL is a benchmark that evaluates evidence verification and judgment accuracy in multimodal reward models across fine-grained perception, document QA, and instruction following tasks.
It employs specialized toolkits for explicit tool invocation, enabling models to retrieve and verify evidence with a transparent chain-of-thought.
The evaluation relies on pairwise and single-response accuracy metrics, which facilitate robust comparisons between agentic and non-agentic models.

ARMBench-VL is a tool-centric, multi-track benchmark designed to evaluate the judgment accuracy and evidence-verification behavior of multimodal reward models, with an explicit focus on models possessing agentic tool-use capabilities. Conceived in the context of aligning multimodal vision-LLMs with human preference signals, ARMBench-VL probes whether reward models can not only judge outputs but also autonomously invoke specific tools to ground their assessments in verifiable evidence across vision and text modalities (Ding et al., 4 Dec 2025).

1. Task Structure and Toolkits

ARMBench-VL is structured around three distinct multimodal judgment tasks, each associated with a specialized toolkit and evaluation format targeting different reasoning capabilities:

Fine-grained Visual Grounding (Fine-grained Perception): The model is presented with a high-resolution image, a question pinpointing a local detail (e.g., "What color is the trim on the bird’s wing?"), and two or four candidate textual responses. The task is to determine which candidate best aligns with the detailed image region, leveraging an image crop and zoom-in tool. The agent can issue JSON-format tool calls specifying a bounding box to crop and magnify an image region, and receives the cropped segment as feedback (Appendix A.5). Outputs are pairwise (2-way or 4-way) rankings, e.g., "Answer 2 is better." Evaluation format: pair-rm (Appendix Fig 8).
Multi-page Document Understanding (Multimodal Long Document QA): The model receives a document rendered as a sequence of page images, a natural-language question (e.g., "What is the membership-fee deadline?"), and one or more candidate answers. Using retrieval tools—doc_page_retrieval_by_query (dense retrieval via a CLIP-based index) and doc_page_retrieval_by_index (direct page access)—the agent can fetch relevant pages based on semantic queries or indices. The output is either a pairwise preference judgment or a single True/False decision according to the evaluation context. Retrieval is critical: answers often require information from specific, non-obvious pages.
Instruction Following (Multimodal Instruction Verification): Given a user instruction (e.g., constraints on format, keywords, length), a candidate response, and an explicit constraint specification, the model must determine whether the response obeys all constraints. The toolkit consists of 19 text-checking validator tools (e.g., WordCountInRangeTool, NotContainSubstringTool), each accessible via function-call semantics. Output is a binary decision: "Overall Judgment: True" or "False" (Appendix Fig 7).

2. Dataset Composition and Annotation

Dataset statistics are summarized in the table below (Appendix A.4, Table 3):

Task	# Samples	Single-RM	Pair-RM 2-way	Pair-RM 4-way
Fine-grained Perception	550	–	163	92
Multimodal Long Document QA	460	173	287	–
Multimodal Instruction Following	489	489	–	–
Total	1,499	662	450	92

Source Datasets:

Fine-grained perception images are drawn from V*Bench/VisualProbe; multi-page document QA inputs from MMlongbench-doc (Multi-Page DocVQA) paged-PDF screenshots; and instruction tasks from MM-IFEngine.

Annotation Protocol:

For pairwise (pair-rm) tasks, each item pairs a "ground-truth" correct response ( $r^+$ ) with one or three plausible but incorrect responses ( $r^-$ ) generated by GPT-4o-mini. For single-response (single-rm) tasks, each candidate is labeled True/False depending on adherence to the specified constraints.

Design Considerations:

Trivial items solvable at 100% accuracy by the base LVLM in a zero-shot setting are excluded, ensuring nontriviality. Negative (incorrect) responses are controlled to be plausible yet incorrect; any that are excessively similar or out-of-scope are removed.

3. Evaluation Protocols and Metrics

ARMBench-VL employs simple, accuracy-based metrics for all three tracks:

Pairwise Ranking Accuracy (2-way/4-way):

$\text{Accuracy} = \frac{\text{Number of times model selects the ground-truth response}}{\text{Total number of pair-rm questions}}$

Single-Response Accuracy:

$\text{Accuracy} = \frac{\text{Number of correct True/False judgments}}{\text{Total single-rm questions}}$

Aggregate Score:

$\text{ARMBench-VL}_{\mathrm{avg}} = \frac{A_{\mathrm{FG}} + A_{\mathrm{Doc}} + A_{\mathrm{IF}}}{3}$

where $A_{\mathrm{FG}}$ , $A_{\mathrm{Doc}}$ , and $A_{\mathrm{IF}}$ denote accuracy on the three sub-tasks.

This design allows for precise, interpretable comparisons across models and tracks, reflecting both judgment correctness and procedural verifiability.

4. Tool Invocation, Evidence Retrieval, and Chain-of-Thought

ARMBench-VL is intentionally constructed so that the pertinent evidence is inaccessible without proper tool invocation—a critical assessment of agentic reasoning. For example, Document QA items may require locating information deeply embedded on a specific page, with retrieval tools (e.g., doc_page_retrieval_by_query) being the only feasible mechanism to uncover it.

Logging and Traceability:
- Whether tools are invoked.
- Frequency and sequence of tool invocations.
- Utilization strategies (e.g., query refinement, over-cropping).
Ablative Findings:

Empirical ablations (Fig 6) confirm that reward models trained solely on answer accuracy do not reliably learn to invoke tools, whereas ARM-Thinker’s two-stage reward (Eq 3 & 4, Sec 3.2.2) scaffolds sufficient tool use for successful verification.

This suggests that chain-of-thought and explicit tool invocation logging enable both fine-grained evaluation and diagnosis of planning behavior in multimodal judgment.

5. Example Prompts, Outputs, and Templates

Prompts are standardized into three templates (App A.6):

Single Judge: For binary or scalar judgment (e.g., instruction following).
N-way Pairwise Judge: For ranking tasks (2-way or 4-way).

Agent Chain-of-Thought (CoT): Enforces an explicit sequence: > ... <tool_call> ... <tool_response> ... <answer> (Appendix Fig 9).

Illustrative prompt and output examples:

Fine-grained Perception (2-way):
1
2
3
4
5
<start_of_instruction>
What color is the rim of the bicycle wheel?
<end_of_instruction>
<start_of_resp_1> "The rim of the bicycle wheel is silver." <end_of_resp_1>
<start_of_resp_2> "The rim of the bicycle wheel is gold." <end_of_resp_2>
Agent:
1
2
3
4
5
<think> The question asks for a color detail… </think>
<tool_call> { "name":"image_crop_and_zoom_in", ... } </tool_call>
<tool_response> [cropped image] </tool_response>
<think> The cropped rim looks metallic-silver. </think>
<answer>Answer 1 is better.</answer>
Document QA (single-rm): Uses retrieval tool to find relevant page, then matches answer.

Instruction Following (single-rm): Calls multiple text validator tools, each returning pass/fail for their constraint.

The inclusion of standardized prompt templates and CoT formatting underpins reproducibility and enables cross-model consistency in evaluation.

6. Infrastructure and Implementation Details

Tool APIs:

All tools are implemented with OpenAI-style function calling and JSON argument schema. Image cropping enforces normalized bounding boxes with a minimum pixel threshold and optional upsampling. Document retrieval tools index with CLIP-ViT-B/32 embeddings stored in a persistent Chroma vector DB, concatenating top-k results as necessary.

Textual Validators:

These are built using NLTK for tokenization and regex for constraint matching, returning boolean check results per invocation.

Data Construction and Filtering:

Negative samples and question rephrasings are expanded via Qwen3-VL-235B-Thinking and GPT-4o-mini; all trivial questions are filtered out. Additional templates for 2-way/4-way long response generation are provided to support reproducibility.

Reproducibility Measures:

Fixed prompt templates and detailed logging ensure that experiments are both reconstructible and comparable across implementations.

7. Significance in Multimodal Reward Modeling

ARMBench-VL serves as a controlled yet challenging environment for evaluating not only the outcome accuracy but also the evidence-retrieval and planning mechanisms utilized by agentic reward models. By requiring explicit tool invocation and documenting the reasoning chain, it introduces requirements unattainable by static, non-interactive reward scorers. The benchmark has been used to demonstrate that agentic models—such as ARM-Thinker—achieve substantial gains (+16.2% average on reward modeling benchmarks, +9.6% on tool use tasks) over non-agentic baselines, especially in tasks necessitating verifiable evidence and procedural correctness (Ding et al., 4 Dec 2025).

A plausible implication is that as reward modeling for multimodal generative systems advances, frameworks adhering to the ARMBench-VL philosophy—verifiable, tool-aware, and chain-of-thought transparent—will become increasingly central to both academic and applied evaluation standards.

Markdown Upgrade to Chat

References (1)

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ARMBench-VL.