YOFO: Single-Forward-Pass Prediction & Judging

Updated 4 July 2026

YOFO refers to single-forward-pass methods that enable prediction and rationalization by efficiently selecting key tokens or generating binary judgments.
The 2023 YOFO uses gradual token elimination in pre-trained language models to support predictions, achieving up to 18.4% improvement in token-level F1 scores.
The 2025 YOFO applies a template-conditioned approach for multimodal judging, transforming image and text inputs into structured yes/no decisions with dependency awareness.

YOFO most commonly expands to “You Only Forward Once,” but the acronym does not denote a single canonical method. In the literature considered here, it names two distinct frameworks: “You Only Forward Once: Prediction and Rationalization in A Single Forward Pass” for unsupervised extractive rationale extraction (Jiang et al., 2023), and “You Only Forward Once: An Efficient Compositional Judging Paradigm” for template-conditioned, single-forward-pass multimodal judging (Zhang et al., 20 Nov 2025). A nearby but distinct acronym is FedYolo, “You Only Load Once,” a federated learning method built around pretrained transformers and parameter-efficient tuning modules rather than a YOFO method (Zhang et al., 2023).

1. Nomenclature and scope

The acronym is shared across at least two research directions, both centered on obtaining multiple functions from a single forward pass, but operating on different modalities, tasks, and training assumptions.

Term	Expansion	Research setting
YOFO	You Only Forward Once	Unsupervised rationale extraction; efficient compositional judging
FedYolo	You Only Load Once	Federated learning with pretrained transformers

The 2023 YOFO addresses unsupervised rationale extraction: given an input text $\mathbf{X}$ and label $y$ , the goal is to learn a compact subset of tokens $\mathbf{Z} \subseteq \mathbf{X}$ that serves as a rationale for the prediction, without rationale labels during training. The 2025 YOFO addresses judgment as requirement verification: given an image and a list of structured requirements, the model predicts a binary yes/no answer for each requirement in one forward pass. FedYolo, by contrast, is explicitly not called YOFO; it is a federated, multi-task, task-modular PEFT method over a shared frozen backbone (Jiang et al., 2023, Zhang et al., 20 Nov 2025, Zhang et al., 2023).

2. YOFO in unsupervised extractive rationale extraction

The 2023 YOFO is introduced against the Rationalizing Neural Prediction (RNP) framework, which follows a generate-then-predict paradigm. In that formulation,

$P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$

with the graph assumption $\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ , i.e. $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ . Operationally, RNP uses a generator $Gen$ to produce a binary mask $\mathbf{m}\in\{0,1\}^L$ , and a predictor $Pred$ that predicts from the masked text:

$\mathbf{m}=Gen(\mathbf{x};\boldsymbol{\theta})$

$y$ 0

$y$ 1

$y$ 2

YOFO rejects the strong assumption that the rationale should be sufficient to predict the golden label. Instead, it adopts a relaxed definition in which rationales are short, coherent, human-readable, and sufficient to support the model’s predictions rather than generating them. The conceptual shift is from $y$ 3 to modeling

$y$ 4

This reformulation is motivated by two failure modes attributed to two-stage models. The first is the interlocking problem: if the generator is initially poor, the predictor may overfit to meaningless but label-distinguishable snippets, reinforcing the generator to keep producing them. The second is spurious correlations: because the predictor only sees selected text, it can latch onto irrelevant but label-correlated phrases. A plausible implication is that YOFO’s single-phase formulation is intended not merely as an efficiency change, but as a change in the faithfulness–optimization trade-off (Jiang et al., 2023).

3. Architecture, optimization, and empirical profile of the 2023 YOFO

The 2023 framework deploys a pre-trained LLM like BERT and conceptually divides it into three parts: Information Gathering (IG), Rationale Generation (RG), and Performance Boosting (PB). Early layers keep all tokens so the model can collect contextual information; middle layers progressively remove unimportant tokens; later layers operate on the retained tokens to improve task performance. Rather than directly choosing the important tokens in an unsupervised manner, YOFO gradually removes unimportant tokens during forward propagation.

Given input text $y$ 5, token embeddings are

$y$ 6

with $y$ 7. At each layer $y$ 8, YOFO computes a binary token mask from the previous hidden states:

$y$ 9

$\mathbf{Z} \subseteq \mathbf{X}$ 0

$\mathbf{Z} \subseteq \mathbf{X}$ 1

$\mathbf{Z} \subseteq \mathbf{X}$ 2

The cumulative product with $\mathbf{Z} \subseteq \mathbf{X}$ 3 enforces monotonic token reduction across layers; $\mathbf{Z} \subseteq \mathbf{X}$ 4 is all ones, and the classification token is never deleted. The paper uses a two-layer MLP for $\mathbf{Z} \subseteq \mathbf{X}$ 5, together with the Gumbel softmax trick. During training, the authors found that directly zeroing hidden states led to very poor training, so they mask attention scores instead:

$\mathbf{Z} \subseteq \mathbf{X}$ 6

$\mathbf{Z} \subseteq \mathbf{X}$ 7

The loss combines task supervision, sparsity control, and a contiguity penalty:

$\mathbf{Z} \subseteq \mathbf{X}$ 8

and

$\mathbf{Z} \subseteq \mathbf{X}$ 9

or, with layerwise length control,

$P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 0

The empirical evaluation uses BeerAdvocate and Hotel Review, with token-level Precision, Recall, F1, and classification Accuracy (ACC). The framework is implemented with PyTorch + HuggingFace Transformers, uses BERT, AdamW, learning rate $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 1, max sequence length 256, batch size 64, and mostly trains for 10 epochs. Reported gains reach up to 18.4\% in token-level F1 compared to previous state-of-the-art methods. On the analysis side, the paper finds that token deletion is best in the middle layers, especially around layers 3 to 6 in one analysis, and that the best specific setting reported is RG from layer 3 to layer 6 with Log decay. This suggests that YOFO is not merely a single-pass replacement for RNP, but a specific claim about where rationale-relevant token elimination should occur in deep PLMs (Jiang et al., 2023).

4. YOFO as a compositional judging paradigm for MLLMs

The 2025 YOFO is proposed for multimodal LLMs (MLLMs) used as judges. The motivating trade-off is explicit. Scalar score prediction approaches adapt an MLLM to output a single relevance score; this is efficient at inference, but it compresses multiple semantic constraints into one number and is argued to be misaligned with the generative nature of autoregressive LLMs/MLLMs. Autoregressive judging approaches generate an explanation, rationale, or multi-step analysis, but they are slow because they require token-by-token decoding. YOFO is positioned between these extremes.

The core observation is that judgment can be decomposed into checking whether an input satisfies a set of structured requirements. The paper defines the multimodal input as an image $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 2 and requirements $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 3, and formalizes YOFO as

$P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 4

where each answer is binary,

$P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 5

The architecture uses a decoder-only MLLM backbone, adds a single token unknown to the end of each requirement $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 6, concatenates all requirements into a template $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 7, tokenizes the template into a sequence $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 8, and records the positions of the unknown answer tokens as $P(\mathbf{y}\mid\mathbf{X})=P(\mathbf{y}\mid\mathbf{Z}, \mathbf{X})P(\mathbf{Z}\mid\mathbf{X}) = P(\mathbf{y}\mid\mathbf{Z})P(\mathbf{Z}\mid\mathbf{X}),$ 9. The model then processes image and template jointly:

$\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 0

For requirement $\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 1, the answer token lies at position $\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 2, and the logits at position $\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 3 define the distribution over the next token:

$\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 4

The practical binary decision rule is explicitly restricted to the two verbalizers “yes” and “no”:

$\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 5

for $\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 6.

The method is called You Only Forward Once because all requirement judgments are obtained from a single forward pass of the autoregressive MLLM. Predictions are obtained jointly in one forward pass, each answer is derived from its own answer-slot logit, later requirements can condition on earlier textual context in the template, but previous answer tokens are not actually generated and fed back one by one. The paper further claims that this structure supports dependency-aware analysis, including a test case in which one requirement states: “The answer to this question is the opposite of the answer to the previous question.” (Zhang et al., 20 Nov 2025)

5. Training, inference, and empirical profile of the 2025 YOFO

Training and validation are built from SA-1B images. For each image, an MLLM proposes properties that are true or false of the image, along with yes/no labels and reasons. This is used to train and validate YOFO as a general-purpose property judge. The main downstream test setting is multimodal recommendation / reranking using LRVS-Fashion / LAION-RVS-Fashion252, where, given two candidate images and a customer-style query, the model must judge which image better matches the query. The paper also constructs a dependency-aware judgment task from SA-1B validation samples. YOFO requires special fine-tuning rather than prompt engineering over a frozen base model: the backbone MLLM is fine-tuned with LoRA, the processor is modified to add special tokens marking answer and reason spans, and the model is trained on data formatted with the YOFO requirement template.

Without post-hoc CoT, the answer loss is

$\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 7

With post-hoc CoT, if $\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 8 is the set of positions of tokens in the reason field,

$\mathbf{y}\leftarrow \mathbf{Z} \leftarrow \mathbf{X}$ 9

and the total loss is

$\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 0

with default $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 1. At inference, the process is: decompose the user query into structured requirements using an LLM, assemble the YOFO template, feed the image and the template into the MLLM, run one forward pass, extract the logits at $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 2, compare the probabilities of “yes” and “no,” and optionally map the per-requirement judgments to a final downstream score using a human-defined expression or rule.

The paper’s headline results are on LAION-RVS-Fashion252. From Table 1, Jina-Reranker-m0 obtains $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 3 and throughput 36.41 pairs/s; YOFO (Qwen2-VL) obtains $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 4 and 35.08 pairs/s; YOFO (Qwen3-VL) obtains $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 5 and 47.6 pairs/s. On SA-1B-derived validation, YOFO improves strongly over base models. For Qwen2-VL, Base gives $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 6 and $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 7, while YOFO gives $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 8 and $\mathbf{X}\perp\!\!\!\perp\mathbf{y}\mid\mathbf{Z}$ 9. For Qwen3-VL, Base gives $Gen$ 0 and $Gen$ 1, while YOFO gives $Gen$ 2 and $Gen$ 3. On dependency-aware judgment, YOFO + dep reaches $Gen$ 4 and $Gen$ 5, whereas normal YOFO training gives $Gen$ 6 and $Gen$ 7. The stated interpretation is that dependency-aware behavior is not automatic; it becomes near-perfect only after training on purposely constructed dependency data (Zhang et al., 20 Nov 2025).

6. Comparative interpretation and naming ambiguities

The two YOFO frameworks share a common naming principle—single-forward-pass computation—but they are methodologically distinct. The 2023 YOFO is a single-phase framework for prediction and rationalization in text classification with a PLM such as BERT, and its defining mechanism is gradually removing unimportant tokens during forward propagation. The 2025 YOFO is a template-conditioned method for compositional judging with a decoder-only MLLM, and its defining mechanism is reading the logits of the final token associated with each requirement to obtain binary decisions. The first treats rationales as token subsets that should support predictions; the second treats judgment as structured yes/no requirement checks (Jiang et al., 2023, Zhang et al., 20 Nov 2025).

The limitations are likewise different. In the 2023 formulation, rationales are supportive, not necessarily sufficient, and the related work discussion notes that RNP has the advantage that unselected text is guaranteed to have no contribution to prediction. In the 2025 formulation, the paper explicitly notes one limitation: it evaluates YOFO only on one downstream task family, namely reranking/recommendation; additional implied limitations include dependence on good requirement decomposition, a binary decomposition assumption, template sensitivity, and the fact that dependency-aware reasoning is learned rather than guaranteed. This suggests that the shared acronym encodes a common computational motif rather than a unified theoretical program (Jiang et al., 2023, Zhang et al., 20 Nov 2025).

A separate source of confusion is FedYolo, which is “You Only Load Once” rather than “You Only Forward Once.” FedYolo belongs to federated learning, where clients load a pretrained transformer once, freeze the backbone, and then perform all future adaptation through small task-specific modules such as adapters, LoRA layers, or prompts, plus a task head. The paper itself states that it is not exactly about a method with that name, but about the closely related or misremembered acronym FedYolo / You Only Load Once. Accordingly, references to YOFO in the literature should be disambiguated carefully between at least these two confirmed YOFO papers and the distinct FedYolo line (Zhang et al., 2023).