MLLM-as-Jury Evaluation Framework

Updated 4 December 2025

The paper introduces the MLLM-as-Jury framework, an iterative pipeline that integrates LLM, T2IM, and MLLM to automate visual pun evaluation and refinement.
It employs targeted edit suggestions and a precise stopping rule based on idiom recognition and confidence thresholds, typically converging within 2–3 iterations.
Empirical results reveal a 45-point accuracy spread among different MLLMs, underlining the critical influence of jury selection on multimodal performance.

The “MLLM-as-Jury” framework refers to evaluation protocols and architectures in which one or more Multimodal LLMs (MLLMs) serve as an automated “jury”—forming judgments, aggregating answers, or iteratively refining outputs in multimodal generative tasks. This paradigm is instantiated most precisely and technically in the context of visual pun generation and recognition, as detailed in “Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework” (Xiao et al., 28 Nov 2025). The MLLM-as-Jury model establishes an interaction loop between a LLM, a text-to-image model (T2IM), and an MLLM; the MLLM in this framework acts as an automated jury whose decision directly governs when an iterative creative process halts and how corrective feedback is issued.

1. High-Level Pipeline Architecture

The MLLM-as-Jury framework is formulated as a four-stage iterative pipeline, parameterized by a target idiom $I_{\rm input}$ , an LLM, a T2IM, an MLLM, and a maximum iteration count $T_{\max} = 5$ . The main stages are:

LLM Prompt Generation: The LLM decomposes $I_{\rm input}$ into literal/figurative elements and, conditioned on past prompts $P_{t-1}$ and edit suggestions $U_{t-1}$ , produces a new visual description prompt $P_t$ .
Image Synthesis: The T2IM (fixed to Qwen-Image at $1024\times1024$ ) renders $P_t$ into an image $G_t$ .
MLLM Comprehension (Jury): The MLLM receives $G_t$ (and, optionally, $I_{\rm input}$ as a candidate label) and attempts to infer which idiom it encodes, returning a top-1 guessed idiom $R_t$ and a confidence score $s_t \in [0,1]$ .
Prompt Refinement: If the jury’s verdict $R_t$ matches $I_{\rm input}$ or $s_t \geq \tau$ , or $t = T_{\max}$ , the process halts. Otherwise, the MLLM produces targeted edits $U_t$ (e.g., compositional or object-level suggestions), which are provided back to the LLM for the next iteration.

This cyclical pipeline continues until the MLLM jury recognizes the idiom or the iteration budget is exhausted. The flow is formally defined in Algorithm 1 of the paper, which includes LaTeX-style pseudocode capturing each update, verdict, and stopping condition.

2. Formalization and Decision Criteria

The iterative process is mathematically formalized as follows:

Inputs: $I_{\rm input}$ , LLM, T2IM, MLLM, $T_{\max}$ , $\tau$
Initialization: $P_0 \gets \varnothing$ , $U_0 \gets \varnothing$
At iteration $t$ :

$\begin{aligned} & P_t \gets \mathrm{LLM}_{\rm prompt}(I_{\rm input}, P_{t-1}, U_{t-1})\ & G_t \gets \mathrm{T2IM}(P_t)\ & (R_t, s_t) \gets \mathrm{MLLM}_{\rm infer}(G_t)\ & \text{Match}_t = [R_t = I_{\rm input}] \lor [s_t \geq \tau] \end{aligned}$

If $\text{Match}_t = 1$ , terminate; otherwise, $U_t \gets \mathrm{MLLM}_{\rm update}(R_t, G_t, I_{\rm input})$ is used to generate the next prompt.

The function $f_{\rm judge}(G, q) = \Pr(\text{idiom}=q \mid \text{image}=G)$ , estimated by the MLLM's idiom-classification head, is used to compute $s_t$ . Typically, $\tau = 0.5$ .

3. Characterization of the MLLM Jury

In each iteration, the MLLM jury operates as follows:

Input: $G_t$ (rendered image), optionally $I_{\rm input}$ . The closed-class inference scenario over all 1,000 idioms allows direct classification, but open-prompt settings are also possible.
Output: $(R_t, s_t)$ , where $R_t \in \{ I_1, \ldots, I_{1000} \}$ is the label with the highest predicted probability, and $s_t$ quantifies the classifier's confidence.

The stopping/recognition rule is:

$\mathrm{Match}_t = 1 \iff (R_t = I_{\rm input}) \lor (s_t \geq \tau)$

Upon a non-match, the MLLM produces edit instructions $U_t$ that guide the LLM’s next prompt generation. This tight feedback loop confers self-refinement capabilities to the system.

4. Empirical Results and Quantitative Analysis

The primary benchmark comprises 1,000 idioms, each processed by the pipeline using various LLMs (for prompt generation) and MLLMs (as jury). Key aspects include:

T2IM: Qwen-Image, fixed resolution.
LLMs: 10 models, including GPT-5, Gemini, Claude, Qwen-3, etc.
MLLMs: 10 models, including GPT-MM, Gemini-MM, Claude-MM, etc.
Metric: Top-1 recognition accuracy (fraction of idioms for which the process achieves $\mathrm{Match}_t$ within 5 steps).

Accuracy table (selected rows)

MLLM	GPT-OSS	GPT	Gemini	Claude	Llama	Gemma
GPT-MM	71.1	76.9	73.7	79.8	64.8	70.1
Gemini-MM	66.7	71.8	69.5	74.8	60.8	65.7
Claude-MM	55.5	59.7	57.6	61.6	50.8	54.9
Gemma-MM	52.0	56.1	54.0	58.1	47.4	51.3

Row averages: 25.3% (Mistral-MM) to 70.8% (GPT-MM) Column averages: 46.8% (Llama prompt) to 57.6% (Claude prompt)

The experimental results unequivocally demonstrate that MLLM (jury) selection dominates performance, with a 45 percentage-point spread between the lowest and highest juries. Ablation confirms that using the LLM for visual prompt refinement and providing iterative feedback leads to +7–15 point improvements from one-shot refinement and +4–9 further points from the first iteration, with convergence by rounds 3–4.

5. Design Choices and System Limitations

Superior juries: GPT-MM and Gemini-MM outperform others, attributed to pretraining on broad vision-language datasets that facilitate idiom grounding and compositional reasoning.
Iteration effectiveness: Most idioms resolve within 2–3 cycles; later rounds confer minimal additional accuracy ( $<$ 0.5%).
Failure modes:
- Abstract idioms lacking concrete visual elements result in low-confidence or erroneous jury classification.
- Scenes that are compositionally dense cause MLLMs to focus excessively on literal objects at the expense of figurative cues.
Generality: The workflow is currently dependent on a single T2IM backend (Qwen-Image) and exclusive reliance on MLLMs for judgment, which may inflate reported accuracy versus human subjective recognizability.
Broader limitations: The MLLM's success criterion (recognition by the automated jury) is not necessarily coincident with human intuitions of pun quality or clarity; future work should include both more diverse image models and parallel human evaluation.

6. Framework Significance and Extensions

The MLLM-as-Jury framework introduces a scalable, modular, and plug-and-play protocol for iterative self-refinement in cross-modal generation and recognition:

It operationalizes the concept of an “MLLM jury” not as a static ensemble or simple majority-vote, but as an endpoint in an autoregressive feedback loop: the jury not only makes accept/reject decisions, but prescribes actionable edits to upstream prompt generators.
The architecture is extensible: additional modalities, more compositional metrics, or hybrid human–MLLM judgments can be incorporated.
The modularity (LLM, T2IM, MLLM as replaceable components) provides avenues for plug-in domain-specialized models as needed.

This protocol has broader implications for closed-loop self-improvement in generative AI systems, particularly when applied to tasks where “recognition-by-expert” is both measurable and actionable (Xiao et al., 28 Nov 2025). It also supplies a benchmark and methodology that enables direct cross-model comparison on compositional and creative multimodal tasks, beyond scalar or pairwise alignment metrics.

References

“Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework” (Xiao et al., 28 Nov 2025)

PDF Markdown Chat (Pro)

References (1)

Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MLLM-as-Jury Framework.