You Only Forward Once: An Efficient Compositional Judging Paradigm

Published 20 Nov 2025 in cs.AI | (2511.16600v2)

Abstract: Multimodal LLMs (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis -- where subsequent judgments are conditioned on previous ones -- and further benefits from post-hoc CoT.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces YOFO, an efficient framework that explicitly verifies atomic requirements in multimodal LLMs using a single forward pass.
It employs a template-conditioned approach with dependency-aware judgments and post-hoc Chain-of-Thought supervision to boost accuracy and throughput.
Experimental results on datasets like SA-1B and LRVS-Fashion demonstrate significant improvements in ranking error rates and interpretability over traditional methods.

Efficient Compositional Judging with YOFO in Multimodal LLMs

Motivation and Background

The paper "You Only Forward Once: An Efficient Compositional Judging Paradigm" (2511.16600) addresses the challenge of fine-grained, scalable judgment in multimodal LLMs (MLLMs), particularly in tasks such as cross-modal retrieval and recommendation. Traditional methods either regress a scalar relevance score—losing granular requirement satisfaction information—or use generative paradigms that are prohibitively slow due to autoregressive decoding, impeding large-scale deployment. YOFO (You Only Forward Once) is proposed to overcome this trade-off by reframing information matching as explicit compositional requirement verification and introducing a template-conditioned mechanism that operates in a single forward pass.

Methodology

Problem Formulation

YOFO explicitly decomposes user queries into a set of atomic requirements. Given an image $I$ and $N$ requirements $(\bm{p}_i)_{i=1}^N$ , YOFO maps them to binary decisions $(\bm{a}_i)_{i=1}^N$ indicating satisfaction.

Architecture

YOFO leverages a decoder-only MLLM, using a structured template encoding the requirements. Each requirement is concatenated with a special marker token; after a single forward pass, the logits for these positions are extracted. The next-token distribution at each marker position is interpreted as the likelihood for "yes" or "no", yielding a binary judgment per requirement without autoregressive output generation.

Inference is massively parallel: all judgments are produced in a single run, resulting in orders-of-magnitude throughput improvements compared to sequential generative approaches. This approach also enables interpretability, as the satisfaction of each requirement is explicitly available for downstream consumption.

Training

YOFO is trained via next-token prediction, with cross-entropy loss applied exclusively to the answer and rationale positions in the template. A post-hoc Chain-of-Thought (CoT) variant supervises both answers and accompanying rationales, with the rationale loss scaled by a coefficient $\lambda$ . This hybrid approach enables more robust reasoning and explicit rationale generation.

Dependency-Aware Judgement

YOFO supports dependency-aware judgments: later requirement answers can condition on previous judgments, implemented by carefully constructing templates and masking supervision to require interdependence. Experiments demonstrate near-perfect accuracy in such setups, validating YOFO's capacity for compositional reasoning beyond independent requirement assessment.

Experimental Results

Datasets and Setup

Training is performed on the SA-1B corpus, with fine-tuning performed via LoRA on Qwen2-VL-2B-Instruct and Qwen3-VL-2B-Instruct backbones. Testing leverages the LRVS-Fashion dataset to assess cross-domain generalization in fashion recommendation tasks.

Performance Benchmarks

YOFO achieves a ranking error rate of 3.7% on LAION-RVS-Fashion, surpassing the 16.2% error of the Jina-Reranker-M0 baseline, while also increasing throughput to 47.6 pairs/s. Notably, YOFO models trained on general-purpose images generalize effectively to fashion data, confirming robustness against domain shift and the utility of explicit compositional judgment.

YOFO also demonstrates significant improvements in property-wise and sample-wise accuracy metrics (above 91% and >40%, respectively) versus pretrained baselines, indicating the effectiveness of template-conditioned supervision. Detailed ablations show the benefits of post-hoc CoT and optimal LoRA rank selection ( $r=64$ yields the best results).

Qualitative Analysis

Case studies highlight YOFO’s strength in correctly resolving multi-attribute queries and negations, where scalar-ranking models fail. Per-position analysis confirms YOFO maintains high accuracy as the number of requirements increases, while baseline models decay rapidly.

Implications and Future Directions

Practical Implications

YOFO establishes a high-throughput, interpretable judging paradigm for MLLMs, enabling real-time, fine-grained analysis suitable for e-commerce recommendation, multi-label classification, and agent-driven decision support scenarios. Explicit requirement-level judgments are directly consumable as reward signals in reinforcement learning (RL) pipelines or as inputs for agent modeling, replacing coarse scalar relevance estimates.

Theoretical Implications

YOFO demonstrates that explicit, template-conditioned requirement satisfaction can be efficiently aligned with the native autoregressive objectives of LLMs, facilitating generalizable and dependency-aware reasoning. The method’s ability to support compositional inference and interpretability bridges the gap between black-box scoring and on-the-fly multi-step reasoning.

Prospects for Further Research

Future work should explore YOFO’s integration as a structured reward model in RL-based LLM or diffusion model training, delivering fine-grained reward vectors for improved sample efficiency and controllable learning. YOFO’s compositional architecture is readily extensible to multi-label, multi-domain, and multi-agent scenarios.

Expanded application domains (beyond reranking) could leverage YOFO’s modularity, including user interest tagging, anomaly detection, and personalized content recommendation, especially where explicit verification of multiple criteria is essential. Further scaling and adaptation to open-world, long-context, and omnimodal judgment settings warrant investigation.

Conclusion

YOFO introduces an efficient compositional judging mechanism for multimodal LLMs, circumventing the throughput limitations and granularity loss of prior paradigms. Through single-pass template conditioning and direct logit reading, YOFO attains state-of-the-art reranking performance, robust cross-domain generalization, and interpretable, dependency-aware judgment. Its explicitness and efficiency open practical and theoretical avenues for structured decision modeling, scalable recommendation, and agent design in multimodal AI systems.

Markdown Report Issue