Visual-ERM: Multimodal Reward Model

Updated 14 May 2026

Visual-ERM is a multimodal generative reward model that provides a precise evaluation of visual fidelity in vision-to-code outputs.
It leverages supervised error annotation and generative likelihoods to compute granular scalar and structured rewards for charts, tables, and SVGs.
Visual-ERM integrates with reinforcement learning via test-time scaling and KL-regularized policy optimization, achieving significant benchmark improvements.

Visual Equivalence Reward Model (Visual-ERM) is a multimodal generative reward model developed to address the limitations of reinforcement learning (RL) in vision-to-code applications, where models reconstruct structured visual inputs—charts, tables, SVGs—into executable or structured representations. Its core contribution is providing fine-grained, interpretable, and task-agnostic feedback for evaluating the visual fidelity of vision-to-code outputs directly in the rendered visual space, overcoming the shortcomings of both text-based and coarse vision-encoder rewards (Liu et al., 13 Mar 2026).

1. Background and Problem Context

Vision-to-code tasks involve translating structured visual artifacts (e.g., scientific charts, complex tables, vector graphics) into code or data representations (such as Python plotting scripts, Markdown tables, SVG markup). High visual fidelity is essential because minor discrepancies in rendered output—such as slight shifts in layout, color, or numeric value—can fundamentally alter semantics (e.g., swapped bars in a histogram, misaligned table cells). Practical applications include scientific figure analysis, automated UI generation, and document understanding.

Existing RL reward signals fall into two categories:

Textual rule-based metrics (such as edit distance, Tree Edit Distance Similarity) operate exclusively on code, failing to capture alignment, style, or minor numeric errors in the rendering. They are vulnerable to reward hacking, where syntactic edits produce high scores without semantic or visual accuracy.
Visual-embedding rewards (e.g., DINO) use global similarity of rendered images, but these are semantically coarse and insensitive to subtle yet crucial discrepancies (such as misplaced legends or minor layout errors).

Both categories yield overly loose or misaligned RL gradients, preventing consistent improvement on pixel-level visual equivalence—the standard by which humans judge output quality in these domains (Liu et al., 13 Mar 2026).

2. Visual-ERM Model Architecture and Reward Formulation

Visual-ERM is built upon a multimodal LVLM (Qwen3-VL-8B-Instruct) fine-tuned as a generative reward model. The system leverages both supervised error annotation and generative likelihoods to deliver structured and scalar rewards for RL.

Data Collection and Annotation:

Training pairs are composed of ground-truth rendered images $I^*$ and corrupted predictions $\hat{y}$ , constructed by (i) targeted perturbations (edit-induced errors on ground-truth text/code) and (ii) natural inference errors from baseline LVLMs.
Each predicted code $\hat{y}$ is rendered to form $\hat{I} = \mathcal{R}_m(\hat{y})$ .
A distillation pipeline employing GPT-5-mini annotates every visual discrepancy between the target and prediction, categorizing each error (e.g., structure, data, style for charts; layout, numeric for tables; shape, style for SVGs), location, description, and severity (minor/moderate/critical).

Supervised Reward Model Training:

Given $x = (I^*, \hat{I})$ and annotation sequence $a$ , the reward model's objective is the negative log-likelihood: $\mathcal{L}(\theta) = \mathbb{E}_{(x,a)}\left[-\log f_\theta(a|x)\right]\,.$

Discrepancy Enumeration and Scalar Reward Transformation:

At inference or RL time, Visual-ERM receives $(I^*, \hat{I})$ , predicts errors $\{e_k\}_{k=1}^K$ and severities $s_k \geq 0$ , then computes the overall severity: $\hat{y}$ 0 normalized per-task as

$\hat{y}$ 1

with a bounded scalar reward: $\hat{y}$ 2 A render-success reward $\hat{y}$ 3 is added, yielding the total reward: $\hat{y}$ 4 where $\hat{y}$ 5 iff rendering succeeds.

Visual-ERM produces structured outputs (JSON error lists) as well as scalar rewards, enabling both granular analysis and simple RL integration (Liu et al., 13 Mar 2026).

3. Reinforcement Learning Integration and Test-Time Reflection

Visual-ERM is integrated into RL via a KL-regularized policy optimization paradigm (GRPO). For the policy $\hat{y}$ 6 with reference $\hat{y}$ 7, the objective is: $\hat{y}$ 8 This formulation combines likelihoods, reward attribution, and policy regularization.

Test-time scaling (TTS) employs Visual-ERM for reflection and revision: after an initial prediction $\hat{y}$ 9, the system renders, scores, and extracts error feedback $\hat{y}$ 0. If $\hat{y}$ 1 is low, the policy conditions on $\hat{y}$ 2 to generate a revision $\hat{y}$ 3, iterating for 2–3 rounds. This iterative correction significantly enhances final output quality, capturing failure sources missed in a single forward pass (Liu et al., 13 Mar 2026).

4. Empirical Evaluation and Benchmark Results

Visual-ERM demonstrates superior RL and inference-time performance on structured vision-to-code tasks. Benchmarks include Chart-to-Code (ChartMimic), Table-to-Markdown (OmniDocBench, olmOCRBench), and SVG-to-Code (UniSVG/ISVGEN). The key results are:

Metric / Task	Baseline (Qwen3-VL-8B)	RL + DINO	RL + Visual-ERM	Absolute Gain (Visual-ERM)
ChartMimic (overall)	≈69.6	≈76.1	78.0	+8.4
Table avg (OmniDocBench)	≈76.8	65.3 (DINO)	79.5	+2.7
SVG-to-Code (ISVGEN)	≈64.2	71.0 (DINO, weak)	68.3	+4.1

Further, applying Visual-ERM RL to VinciCoder-8B-SFT (72.9/61.9) raises performance by +10.1 average in chart-to-code. Test-time reflection with Visual-ERM (TTS) alone boosts ChartMimic direct from 67.7 to ≈75.6, exceeding 80% accuracy when combined with RL (Liu et al., 13 Mar 2026).

5. VisualCritic-RewardBench Development and Comparative Assessment

VisualCritic-RewardBench (VC-RewardBench) is designed to rigorously measure fine-grained image-to-image discrepancies in structured domains. The benchmark comprises 1,335 high-quality instances from 4,500 candidate pairs, annotated using a pipeline involving GPT-5-mini, Gemini-2.5-Pro, Gemini-3-Pro, and PhD-level human consolidation.

Scoring Protocol:

Predicted error items are matched with ground-truth annotations using an LLM-as-Judge, scoring strict F1 $\hat{y}$ 4, soft F1 $\hat{y}$ 5, and Pearson correlation $\hat{y}$ 6 for overall severity.

Model	F1 $\hat{y}$ 7	F1 $\hat{y}$ 8	$\hat{y}$ 9
Qwen3-VL-8B	5.3	6.5	17.5
Qwen3-VL-235B	29.5	32.4	56.2
Visual-ERM-8B	42.1	44.7	58.4
GPT-5.2 (closed)	32.7	35.0	58.9

This reflects that a specialized 8B model trained with fine-grained discrepancies outperforms much larger generalist LVLMs and approaches leading proprietary models.

6. Ablations, Interpretability, and Known Limitations

Ablation studies reveal:

Multi-task data yields stronger generalization: Training on mixed Chart+Table+SVG data delivers F1 $\hat{I} = \mathcal{R}_m(\hat{y})$ 0=42.1 (vs. $\hat{I} = \mathcal{R}_m(\hat{y})$ 132 for any single-task variant). In Table RL, mixed reward training yields a +2.7 average improvement (vs. +1.8 for table-only RM).
Render-success reward (RSR): Adding $\hat{I} = \mathcal{R}_m(\hat{y})$ 2 to severity-based reward slightly improves stability and parsing metrics (e.g., Table avg: 79.5 vs 79.0).
TTS iterations: Three rounds of test-time scaling capture most benefit; two are weaker, and four show diminishing returns.
Judge robustness: F1 scores on VC-RewardBench vary by at most ~2 points depending on the LLM judge model.

Visual-ERM uniquely outputs structured JSON error summaries (type, location, description, severity), offering actionable, interpretable, and task-agnostic evaluation not reliant on task-specific reward design.

Limitations:

The annotation pipeline depends on proprietary GPT-5-mini outputs and human validation, implying annotation scalability costs.
The current taxonomy covers only select error types (charts: structure/data/text/style; tables: layout/text/numeric; SVG: shape/style/text_symbol/structure), requiring extension to new domains like UI layouts or infographics.
Evaluation and RL are conducted on 8B-parameter policies. Integration with much larger LLMs, or alternative preference/ranking paradigms (such as BT or DPO), remains unexplored.

A plausible implication is that fine-grained, cross-modal reward supervision in rendered visual space is both necessary and sufficient to enhance RL for vision-to-code tasks, as evidenced by the robust gains from an 8B reward model across multiple domains. Visual-ERM also establishes a methodology for interpretable test-time reflection and sets a new standard for evaluation benchmarks of image-to-image fidelity (Liu et al., 13 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Visual-ERM: Reward Modeling for Visual Equivalence (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual-ERM.