Papers
Topics
Authors
Recent
2000 character limit reached

MLLM-as-Jury Evaluation Framework

Updated 4 December 2025
  • The paper introduces the MLLM-as-Jury framework, an iterative pipeline that integrates LLM, T2IM, and MLLM to automate visual pun evaluation and refinement.
  • It employs targeted edit suggestions and a precise stopping rule based on idiom recognition and confidence thresholds, typically converging within 2–3 iterations.
  • Empirical results reveal a 45-point accuracy spread among different MLLMs, underlining the critical influence of jury selection on multimodal performance.

The “MLLM-as-Jury” framework refers to evaluation protocols and architectures in which one or more Multimodal LLMs (MLLMs) serve as an automated “jury”—forming judgments, aggregating answers, or iteratively refining outputs in multimodal generative tasks. This paradigm is instantiated most precisely and technically in the context of visual pun generation and recognition, as detailed in “Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework” (Xiao et al., 28 Nov 2025). The MLLM-as-Jury model establishes an interaction loop between a LLM, a text-to-image model (T2IM), and an MLLM; the MLLM in this framework acts as an automated jury whose decision directly governs when an iterative creative process halts and how corrective feedback is issued.

1. High-Level Pipeline Architecture

The MLLM-as-Jury framework is formulated as a four-stage iterative pipeline, parameterized by a target idiom IinputI_{\rm input}, an LLM, a T2IM, an MLLM, and a maximum iteration count Tmax=5T_{\max} = 5. The main stages are:

  1. LLM Prompt Generation: The LLM decomposes IinputI_{\rm input} into literal/figurative elements and, conditioned on past prompts Pt1P_{t-1} and edit suggestions Ut1U_{t-1}, produces a new visual description prompt PtP_t.
  2. Image Synthesis: The T2IM (fixed to Qwen-Image at 1024×10241024\times1024) renders PtP_t into an image GtG_t.
  3. MLLM Comprehension (Jury): The MLLM receives GtG_t (and, optionally, IinputI_{\rm input} as a candidate label) and attempts to infer which idiom it encodes, returning a top-1 guessed idiom RtR_t and a confidence score st[0,1]s_t \in [0,1].
  4. Prompt Refinement: If the jury’s verdict RtR_t matches IinputI_{\rm input} or stτs_t \geq \tau, or t=Tmaxt = T_{\max}, the process halts. Otherwise, the MLLM produces targeted edits UtU_t (e.g., compositional or object-level suggestions), which are provided back to the LLM for the next iteration.

This cyclical pipeline continues until the MLLM jury recognizes the idiom or the iteration budget is exhausted. The flow is formally defined in Algorithm 1 of the paper, which includes LaTeX-style pseudocode capturing each update, verdict, and stopping condition.

2. Formalization and Decision Criteria

The iterative process is mathematically formalized as follows:

  • Inputs: IinputI_{\rm input}, LLM, T2IM, MLLM, TmaxT_{\max}, τ\tau
  • Initialization: P0P_0 \gets \varnothing, U0U_0 \gets \varnothing
  • At iteration tt:

PtLLMprompt(Iinput,Pt1,Ut1) GtT2IM(Pt) (Rt,st)MLLMinfer(Gt) Matcht=[Rt=Iinput][stτ]\begin{aligned} & P_t \gets \mathrm{LLM}_{\rm prompt}(I_{\rm input}, P_{t-1}, U_{t-1})\ & G_t \gets \mathrm{T2IM}(P_t)\ & (R_t, s_t) \gets \mathrm{MLLM}_{\rm infer}(G_t)\ & \text{Match}_t = [R_t = I_{\rm input}] \lor [s_t \geq \tau] \end{aligned}

If Matcht=1\text{Match}_t = 1, terminate; otherwise, UtMLLMupdate(Rt,Gt,Iinput)U_t \gets \mathrm{MLLM}_{\rm update}(R_t, G_t, I_{\rm input}) is used to generate the next prompt.

The function fjudge(G,q)=Pr(idiom=qimage=G)f_{\rm judge}(G, q) = \Pr(\text{idiom}=q \mid \text{image}=G), estimated by the MLLM's idiom-classification head, is used to compute sts_t. Typically, τ=0.5\tau = 0.5.

3. Characterization of the MLLM Jury

In each iteration, the MLLM jury operates as follows:

  • Input: GtG_t (rendered image), optionally IinputI_{\rm input}. The closed-class inference scenario over all 1,000 idioms allows direct classification, but open-prompt settings are also possible.
  • Output: (Rt,st)(R_t, s_t), where Rt{I1,,I1000}R_t \in \{ I_1, \ldots, I_{1000} \} is the label with the highest predicted probability, and sts_t quantifies the classifier's confidence.

The stopping/recognition rule is:

Matcht=1    (Rt=Iinput)(stτ)\mathrm{Match}_t = 1 \iff (R_t = I_{\rm input}) \lor (s_t \geq \tau)

Upon a non-match, the MLLM produces edit instructions UtU_t that guide the LLM’s next prompt generation. This tight feedback loop confers self-refinement capabilities to the system.

4. Empirical Results and Quantitative Analysis

The primary benchmark comprises 1,000 idioms, each processed by the pipeline using various LLMs (for prompt generation) and MLLMs (as jury). Key aspects include:

  • T2IM: Qwen-Image, fixed resolution.
  • LLMs: 10 models, including GPT-5, Gemini, Claude, Qwen-3, etc.
  • MLLMs: 10 models, including GPT-MM, Gemini-MM, Claude-MM, etc.
  • Metric: Top-1 recognition accuracy (fraction of idioms for which the process achieves Matcht\mathrm{Match}_t within 5 steps).

Accuracy table (selected rows)

MLLM GPT-OSS GPT Gemini Claude Llama Gemma
GPT-MM 71.1 76.9 73.7 79.8 64.8 70.1
Gemini-MM 66.7 71.8 69.5 74.8 60.8 65.7
Claude-MM 55.5 59.7 57.6 61.6 50.8 54.9
Gemma-MM 52.0 56.1 54.0 58.1 47.4 51.3

Row averages: 25.3% (Mistral-MM) to 70.8% (GPT-MM) Column averages: 46.8% (Llama prompt) to 57.6% (Claude prompt)

The experimental results unequivocally demonstrate that MLLM (jury) selection dominates performance, with a 45 percentage-point spread between the lowest and highest juries. Ablation confirms that using the LLM for visual prompt refinement and providing iterative feedback leads to +7–15 point improvements from one-shot refinement and +4–9 further points from the first iteration, with convergence by rounds 3–4.

5. Design Choices and System Limitations

  • Superior juries: GPT-MM and Gemini-MM outperform others, attributed to pretraining on broad vision-language datasets that facilitate idiom grounding and compositional reasoning.
  • Iteration effectiveness: Most idioms resolve within 2–3 cycles; later rounds confer minimal additional accuracy (<<0.5%).
  • Failure modes:
    • Abstract idioms lacking concrete visual elements result in low-confidence or erroneous jury classification.
    • Scenes that are compositionally dense cause MLLMs to focus excessively on literal objects at the expense of figurative cues.
  • Generality: The workflow is currently dependent on a single T2IM backend (Qwen-Image) and exclusive reliance on MLLMs for judgment, which may inflate reported accuracy versus human subjective recognizability.
  • Broader limitations: The MLLM's success criterion (recognition by the automated jury) is not necessarily coincident with human intuitions of pun quality or clarity; future work should include both more diverse image models and parallel human evaluation.

6. Framework Significance and Extensions

The MLLM-as-Jury framework introduces a scalable, modular, and plug-and-play protocol for iterative self-refinement in cross-modal generation and recognition:

  • It operationalizes the concept of an “MLLM jury” not as a static ensemble or simple majority-vote, but as an endpoint in an autoregressive feedback loop: the jury not only makes accept/reject decisions, but prescribes actionable edits to upstream prompt generators.
  • The architecture is extensible: additional modalities, more compositional metrics, or hybrid human–MLLM judgments can be incorporated.
  • The modularity (LLM, T2IM, MLLM as replaceable components) provides avenues for plug-in domain-specialized models as needed.

This protocol has broader implications for closed-loop self-improvement in generative AI systems, particularly when applied to tasks where “recognition-by-expert” is both measurable and actionable (Xiao et al., 28 Nov 2025). It also supplies a benchmark and methodology that enables direct cross-model comparison on compositional and creative multimodal tasks, beyond scalar or pairwise alignment metrics.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MLLM-as-Jury Framework.