- The paper introduces YOFO, an efficient framework that explicitly verifies atomic requirements in multimodal LLMs using a single forward pass.
- It employs a template-conditioned approach with dependency-aware judgments and post-hoc Chain-of-Thought supervision to boost accuracy and throughput.
- Experimental results on datasets like SA-1B and LRVS-Fashion demonstrate significant improvements in ranking error rates and interpretability over traditional methods.
Efficient Compositional Judging with YOFO in Multimodal LLMs
Motivation and Background
The paper "You Only Forward Once: An Efficient Compositional Judging Paradigm" (2511.16600) addresses the challenge of fine-grained, scalable judgment in multimodal LLMs (MLLMs), particularly in tasks such as cross-modal retrieval and recommendation. Traditional methods either regress a scalar relevance score—losing granular requirement satisfaction information—or use generative paradigms that are prohibitively slow due to autoregressive decoding, impeding large-scale deployment. YOFO (You Only Forward Once) is proposed to overcome this trade-off by reframing information matching as explicit compositional requirement verification and introducing a template-conditioned mechanism that operates in a single forward pass.
Methodology
YOFO explicitly decomposes user queries into a set of atomic requirements. Given an image I and N requirements (pi​)i=1N​, YOFO maps them to binary decisions (ai​)i=1N​ indicating satisfaction.
Architecture
YOFO leverages a decoder-only MLLM, using a structured template encoding the requirements. Each requirement is concatenated with a special marker token; after a single forward pass, the logits for these positions are extracted. The next-token distribution at each marker position is interpreted as the likelihood for "yes" or "no", yielding a binary judgment per requirement without autoregressive output generation.
Inference is massively parallel: all judgments are produced in a single run, resulting in orders-of-magnitude throughput improvements compared to sequential generative approaches. This approach also enables interpretability, as the satisfaction of each requirement is explicitly available for downstream consumption.
Training
YOFO is trained via next-token prediction, with cross-entropy loss applied exclusively to the answer and rationale positions in the template. A post-hoc Chain-of-Thought (CoT) variant supervises both answers and accompanying rationales, with the rationale loss scaled by a coefficient λ. This hybrid approach enables more robust reasoning and explicit rationale generation.
Dependency-Aware Judgement
YOFO supports dependency-aware judgments: later requirement answers can condition on previous judgments, implemented by carefully constructing templates and masking supervision to require interdependence. Experiments demonstrate near-perfect accuracy in such setups, validating YOFO's capacity for compositional reasoning beyond independent requirement assessment.
Experimental Results
Datasets and Setup
Training is performed on the SA-1B corpus, with fine-tuning performed via LoRA on Qwen2-VL-2B-Instruct and Qwen3-VL-2B-Instruct backbones. Testing leverages the LRVS-Fashion dataset to assess cross-domain generalization in fashion recommendation tasks.
YOFO achieves a ranking error rate of 3.7% on LAION-RVS-Fashion, surpassing the 16.2% error of the Jina-Reranker-M0 baseline, while also increasing throughput to 47.6 pairs/s. Notably, YOFO models trained on general-purpose images generalize effectively to fashion data, confirming robustness against domain shift and the utility of explicit compositional judgment.
YOFO also demonstrates significant improvements in property-wise and sample-wise accuracy metrics (above 91% and >40%, respectively) versus pretrained baselines, indicating the effectiveness of template-conditioned supervision. Detailed ablations show the benefits of post-hoc CoT and optimal LoRA rank selection (r=64 yields the best results).
Qualitative Analysis
Case studies highlight YOFO’s strength in correctly resolving multi-attribute queries and negations, where scalar-ranking models fail. Per-position analysis confirms YOFO maintains high accuracy as the number of requirements increases, while baseline models decay rapidly.
Implications and Future Directions
Practical Implications
YOFO establishes a high-throughput, interpretable judging paradigm for MLLMs, enabling real-time, fine-grained analysis suitable for e-commerce recommendation, multi-label classification, and agent-driven decision support scenarios. Explicit requirement-level judgments are directly consumable as reward signals in reinforcement learning (RL) pipelines or as inputs for agent modeling, replacing coarse scalar relevance estimates.
Theoretical Implications
YOFO demonstrates that explicit, template-conditioned requirement satisfaction can be efficiently aligned with the native autoregressive objectives of LLMs, facilitating generalizable and dependency-aware reasoning. The method’s ability to support compositional inference and interpretability bridges the gap between black-box scoring and on-the-fly multi-step reasoning.
Prospects for Further Research
Future work should explore YOFO’s integration as a structured reward model in RL-based LLM or diffusion model training, delivering fine-grained reward vectors for improved sample efficiency and controllable learning. YOFO’s compositional architecture is readily extensible to multi-label, multi-domain, and multi-agent scenarios.
Expanded application domains (beyond reranking) could leverage YOFO’s modularity, including user interest tagging, anomaly detection, and personalized content recommendation, especially where explicit verification of multiple criteria is essential. Further scaling and adaptation to open-world, long-context, and omnimodal judgment settings warrant investigation.
Conclusion
YOFO introduces an efficient compositional judging mechanism for multimodal LLMs, circumventing the throughput limitations and granularity loss of prior paradigms. Through single-pass template conditioning and direct logit reading, YOFO attains state-of-the-art reranking performance, robust cross-domain generalization, and interpretable, dependency-aware judgment. Its explicitness and efficiency open practical and theoretical avenues for structured decision modeling, scalable recommendation, and agent design in multimodal AI systems.