Preference-Anchored Rationalization (PARROT)

Updated 4 July 2026

PARROT is a framework that formalizes decision-making by linking a stable true preference with a set of justifiable rationales through lexicographic choice.
It demonstrates applications from economic choice theory to machine learning, where preference signals are enriched with structured or free-form textual justifications.
Extensions of PARROT integrate rationale generation and critique into preference and reward modeling, improving efficiency and consistency in complex decision tasks.

Searching arXiv for the specified PARROT-related papers to ground the article in the cited literature. arXiv search query: (Ridout, 2020) Preference-Anchored Rationalization PARROT; (Just et al., 2024) Data-Centric Human Preference Optimization with Rationales; (Caradonna et al., 2024) Revealed Invariant Preference; (Huang et al., 2 Feb 2026) LongVPO; (Wang et al., 13 Apr 2026) RationalRewards Preference-Anchored Rationalization (PARROT) denotes a family of formalisms that connect observed preference with explicit justification or rationale structure. In its original choice-theoretic form, PARROT models a decision maker who has a stable “true” preference but restricts attention to outcomes that are top-ranked by some “justifiable” preference, and then selects the true-best element within that justifiable set (Ridout, 2020). In more recent machine-learning work, the same anchoring idea is used to enrich pairwise preference data with rationales, either by jointly optimizing preference and rationale likelihood or by recovering rationale supervision from preference labels through anchored generation, filtering, and distillation (Just et al., 2024, Wang et al., 13 Apr 2026). Across these uses, the common structural theme is that a preference signal is not treated as a bare binary outcome: it is tied to an admissible explanation space that constrains inference, learning, or both.

1. Choice-theoretic core

In the original model, the choice universe is a nonempty set of feasible outcomes $\mathcal{A}$ , menus are nonempty finite subsets $A \subseteq \mathcal{A}$ , and a choice correspondence $c$ assigns to each menu a nonempty subset $c(A) \subseteq A$ . The decision maker has a single “true” preference $\succeq^\star$ , assumed complete and transitive, together with a nonempty set $J$ of “justifications,” each itself a complete and transitive order on the same outcome space. PARROT first forms the justifiable set

$M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$

and then breaks ties within $M(A)$ by the true preference: $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ Equivalently, choice is lexicographic: the decision maker first avoids unjustifiable outcomes and then selects the $\succeq^\star$ -best alternative among those that remain (Ridout, 2020).

This formulation was introduced for settings in which behavior is constrained by morality, rationality, or related virtues, while underlying motives may differ. The model therefore separates two layers that are often conflated in standard revealed-preference analysis: a stable underlying ranking and a set of admissible rationales. The data that PARROT seeks to explain are not arbitrary inconsistencies, but patterns in which some options are excluded unless they can be defended by at least one acceptable ordering.

The same paper emphasizes that the justifications are full linear orders rather than unconstrained ex post stories. This restriction is central to tractability and identification. In the expected-utility case on lotteries $A \subseteq \mathcal{A}$ 0, the true preference admits a utility representation $A \subseteq \mathcal{A}$ 1, each justifiable preference admits a utility $A \subseteq \mathcal{A}$ 2, and the set of justifiable utilities $A \subseteq \mathcal{A}$ 3 is compact and convex (Ridout, 2020).

2. Representation, axioms, and identification

The behavioral representation result with known true preference is sharp. A choice correspondence admits a PARROT representation $A \subseteq \mathcal{A}$ 4 if and only if it satisfies two conditions: Optimization, meaning that if $A \subseteq \mathcal{A}$ 5 then $A \subseteq \mathcal{A}$ 6, and Irrelevance of Unjustifiable Alternatives (IUA), meaning that whenever $A \subseteq \mathcal{A}$ 7 but $A \subseteq \mathcal{A}$ 8, removing $A \subseteq \mathcal{A}$ 9 from any larger menu $c$ 0 leaves choice unchanged. When an observable dominance relation is added, IUA is strengthened to Irrelevance of Submaximal Alternatives (ISA), and the corresponding representation requires justifiable preferences to respect that dominance relation (Ridout, 2020).

In the expected-utility special case, the representation theorem adds Independence, Continuity, Monotonicity, and Convexity to IUA and Optimization. Under these axioms, $c$ 1 has an EU-PARROT representation $c$ 2, and in that case there is a unique minimal and maximal set of justifiable EU utilities once the true preference is known (Ridout, 2020). This is one of the main reasons the framework is stronger than unconstrained rationalization: the data restrict not only observed choice but also the admissible justification set itself.

When the true preference is unknown, the model still imposes substantial structure. Corollary 5 defines a revealed-preference relation $c$ 3 from the choice data and states that acyclicity of $c$ 4 is necessary and sufficient for the existence of some true preference and justifications that rationalize $c$ 5. Theorem 6 then introduces Irrelevance of Excluded Alternatives (IEA) as the analog of IUA when $c$ 6 is unobserved; IEA holds if and only if $c$ 7 admits a PARROT representation. The theorem also yields a canonical representation $c$ 8, where $c$ 9 is the unique strict extension of the revealed relation on all 3-cycles plus standard WARP pairs, and $c(A) \subseteq A$ 0 is the maximal set of total orders consistent with all revealed exclusions (Ridout, 2020).

The paper’s applications illustrate the intended scope. Snyder and Kleck’s wheelchair-avoidance data are represented through a choice cycle explained by a true ranking $c(A) \subseteq A$ 1 and justifiable preferences that all rank $c(A) \subseteq A$ 2 above $c(A) \subseteq A$ 3 but disagree about $c(A) \subseteq A$ 4. Norton et al.’s hiring example is interpreted as a case where the true preference favors the male candidate, while justifications are restricted to “education matters” or “experience matters.” Additional examples include bribery, distributional preferences, charitable giving under risk, and ambiguity (Ridout, 2020). A plausible implication is that PARROT is best understood not as a generic model of inconsistency, but as a structured model of exclusion by justificatory constraints.

3. Invariance and generalized rationalizability

A distinct but related use of the PARROT label appears in the invariant-rationalizability literature. Here the basic data are a pair $c(A) \subseteq A$ 5 of weak and strict revealed-preference relations on a set $c(A) \subseteq A$ 6, and the question is whether there exists a complete, transitive, $c(A) \subseteq A$ 7-invariant preference $c(A) \subseteq A$ 8 extending those relations. $c(A) \subseteq A$ 9-invariance means that for each transformation $\succeq^\star$ 0,

$\succeq^\star$ 1

or, equivalently in many cases, that the comparison is preserved in both directions under the transformation family. The same framework covers quasilinearity, homotheticity, mixture-independence, Koopmans stationarity, separability, and several risk- and ambiguity-related axioms by choosing an appropriate $\succeq^\star$ 2 (Caradonna et al., 2024).

When $\succeq^\star$ 3 is commutative, the characterization is especially simple. Define the $\succeq^\star$ 4-closure $\succeq^\star$ 5 of a relation $\succeq^\star$ 6 by

$\succeq^\star$ 7

Then there exists a complete, transitive, $\succeq^\star$ 8-invariant rationalization of $\succeq^\star$ 9 if and only if $J$ 0 is acyclic. When $J$ 1, this reduces to the classical acyclicity condition of Richter (Caradonna et al., 2024).

The noncommuting case is more involved. The framework introduces broken cycles, forbidden subrelations, and a collapse operation. Starting from all forbidden subrelations generated by broken cycles, one iteratively closes the set under collapses: $J$ 2 and defines strong acyclicity by the absence of $J$ 3 in $J$ 4. The main theorem states that $J$ 5 is rationalizable by a complete, transitive, $J$ 6-invariant preference if and only if it is strongly acyclic (Caradonna et al., 2024).

This construction is tied to an automated-theorem-proving perspective. Existence of an invariant preference is reformulated as propositional satisfiability over variables of the form $J$ 7 and $J$ 8; broken cycles generate clauses forbidding patterns of literals, the collapse operation mirrors propositional resolution, and Robinson’s refutation-completeness under negative resolution yields the equivalence between $J$ 9 and unsatisfiability (Caradonna et al., 2024). The same machinery also gives a generalized Dushnik–Miller theorem: for strongly acyclic data, the intersection of all complete, transitive, $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 0-invariant rationalizations is exactly the set of comparisons not appearing in any forbidden subrelation of the fixed point $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 1.

4. Rationale-enriched preference learning

In large-language-model alignment, the data-centric reinterpretation of PARROT begins from the standard RLHF or DPO setting, where one observes prompts $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 2 and paired responses $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 3 and $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 4, with the assumption

$M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 5

The proposal is to enrich each datapoint $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 6 into $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 7, where $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 8 is a free-form text rationale explaining why $M(A) = \bigcup_{\succeq_J \in J} \operatorname{Argmax}(A,\succeq_J),$ 9 is preferred over $M(A)$ 0. The enriched distribution factorizes as

$M(A)$ 1

The first term is the usual Bradley–Terry or DPO preference model, while the second term teaches the model to generate or recognize explanatory text conditioned on the preference event (Just et al., 2024).

Under DPO, the standard loss is

$M(A)$ 2

The rationale-augmented version defines

$M(A)$ 3

where $M(A)$ 4 weights rationale learning. In practice, rationales are generated cheaply by prompting an off-the-shelf LLM such as Mistral-7B, Llama3-8B, or GPT-3.5 with a few-shot template asking why $M(A)$ 5 is preferred over $M(A)$ 6. For DPO-based methods, the pipeline first performs one epoch of SFT on the chosen responses $M(A)$ 7; for ORPO this step is skipped. Training then proceeds on minibatches of $M(A)$ 8, and at inference time the learned policy is used normally to generate a response from $M(A)$ 9, without requiring rationales (Just et al., 2024).

The information-theoretic analysis introduces $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 0, a binary preference variable $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 1, and rationale $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 2. The conditional mutual information $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 3 measures how much additional signal the rationale carries about the preference. The paper states that, under mild assumptions, adding $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 4 reduces sample complexity, with the rationale-aware generalization bound depending on $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 5 and the unaugmented bound depending on $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 6, where $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 7 captures irrelevant information in $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 8 (Just et al., 2024). This suggests that rationale quality matters not merely as auxiliary text supervision, but as a control on relevance versus irrelevance in the preference signal.

The experiments support that interpretation. On Orca-DPO-Pairs, DPO needs $c(A) = \operatorname{Argmax}(M(A),\succeq^\star).$ 9 samples to hit $\succeq^\star$ 0 win-rate vs SFT, whereas RDPO reaches $\succeq^\star$ 1 at only $\succeq^\star$ 2 samples; with the full $\succeq^\star$ 3, RDPO peaks at $\succeq^\star$ 4. Even when DPO is trained on $\succeq^\star$ 5– $\succeq^\star$ 6 points, RDPO trained on as few as $\succeq^\star$ 7– $\succeq^\star$ 8 still wins $\succeq^\star$ 9 of head-to-head comparisons. On Orca test prompts, average output length is approximately $A \subseteq \mathcal{A}$ 00 tokens for DPO versus approximately $A \subseteq \mathcal{A}$ 01 for RDPO, and TriviaQA exact match is $A \subseteq \mathcal{A}$ 02 for DPO versus $A \subseteq \mathcal{A}$ 03 for RDPO. In the ORPO setting, the win-rate rises from $A \subseteq \mathcal{A}$ 04 to $A \subseteq \mathcal{A}$ 05 when rationales are added (Just et al., 2024).

Ablations further delimit the mechanism. “General” rationales lead to faster early convergence than DPO or SFT, while both general and “Detailed” rationales outperform DPO in the long run. Permuted rationales yield $A \subseteq \mathcal{A}$ 06 win-rate vs correct RDPO, and “Opposite” rationales yield only $A \subseteq \mathcal{A}$ 07 vs $A \subseteq \mathcal{A}$ 08. Rationales generated by smaller models such as Phi-3-Mini, Mistral, and Llama3 all improve preference tuning relative to DPO, with best results when source and target coincide. A sweep over $A \subseteq \mathcal{A}$ 09 shows robust wins with $A \subseteq \mathcal{A}$ 10– $A \subseteq \mathcal{A}$ 11, and additional epochs beyond $A \subseteq \mathcal{A}$ 12 do not further improve RDPO (Just et al., 2024).

5. Anchored rationale recovery and reasoning rewards

In multimodal reward modeling, PARROT is formulated as a three-phase pipeline for recovering rationales from pairwise human preference data. The setup assumes

$A \subseteq \mathcal{A}$ 13

where $A \subseteq \mathcal{A}$ 14 is the conditioning context and $A \subseteq \mathcal{A}$ 15, $A \subseteq \mathcal{A}$ 16 are two outputs such that a human annotator preferred $A \subseteq \mathcal{A}$ 17 over $A \subseteq \mathcal{A}$ 18. A rationale-generator is introduced, with a Teacher variational posterior $A \subseteq \mathcal{A}$ 19, a Student prior $A \subseteq \mathcal{A}$ 20, and a joint model $A \subseteq \mathcal{A}$ 21. The framework derives the ELBO

$A \subseteq \mathcal{A}$ 22

Phase 1 uses Preference-Anchored Generation: the Teacher is prompted with a preference anchor such as “Hint: human preference is: $A \subseteq \mathcal{A}$ 23” or its negative counterpart to generate candidate rationales. Phase 2 uses Consistency Filtering: a sample is retained only if the Teacher can recover the original label from the rationale alone, and empirically about $A \subseteq \mathcal{A}$ 24– $A \subseteq \mathcal{A}$ 25 of generated rationales survive this filter. Phase 3 uses Distillation into the Student by maximizing $A \subseteq \mathcal{A}$ 26 on the filtered dataset (Wang et al., 13 Apr 2026).

The resulting RationalRewards model uses structured critiques in two ways. First, the rationale contains scalar subscores $A \subseteq \mathcal{A}$ 27, with dimensions listed as Text Faithfulness, Image Faithfulness, Physical Plausibility, and Text Rendering, and these are aggregated into a scalar reward

$A \subseteq \mathcal{A}$ 28

Second, the model supports a Generate–Critique–Refine loop at test time: generate an image $A \subseteq \mathcal{A}$ 29, critique it, and if any subscore falls below $A \subseteq \mathcal{A}$ 30 then emit a refined prompt and regenerate. The details specify $A \subseteq \mathcal{A}$ 31 as an example threshold, and state that a single iteration recovers up to $A \subseteq \mathcal{A}$ 32 of the gains of expensive RL fine-tuning (Wang et al., 13 Apr 2026).

The reported results are specific. The 8B PARROT-trained model achieves approximately $A \subseteq \mathcal{A}$ 33 preference-prediction accuracy on text-to-image and approximately $A \subseteq \mathcal{A}$ 34 on editing, outperforming all open-source scalar models and matching Gemini-2.5-Pro. On UniGenBench++, FLUX.1-dev rises from $A \subseteq \mathcal{A}$ 35 to $A \subseteq \mathcal{A}$ 36 overall when tuned with RationalRewards, compared with $A \subseteq \mathcal{A}$ 37 for MultiReward and $A \subseteq \mathcal{A}$ 38 for the raw Teacher. Figure 1 is summarized as showing that scalar rewards exhibit high variance and “hack” generators, whereas RationalRewards yields smooth, monotonic training curves with decaying standard deviation. In test-time prompt tuning, a single Generate–Critique–Refine iteration raises ImgEdit-Bench scores from $A \subseteq \mathcal{A}$ 39 to $A \subseteq \mathcal{A}$ 40 and GEdit-Bench-EN from $A \subseteq \mathcal{A}$ 41 to $A \subseteq \mathcal{A}$ 42, with only $A \subseteq \mathcal{A}$ 43 s of extra inference (Wang et al., 13 Apr 2026).

A plausible implication is that, in this line of work, PARROT is not merely a rationale-extraction procedure. It is an architectural principle for converting pairwise preference labels into deployable critique models whose outputs remain useful both as training rewards and as test-time control signals.

6. PARROT-style anchoring in long-video preference optimization and cross-literature limits

LongVPO describes a “preference-anchored rationalization” implementation for long-form video understanding. Stage 1 synthesizes training triplets $A \subseteq \mathcal{A}$ 44 from short-clip QA data by anchoring each question to exactly one clip $A \subseteq \mathcal{A}$ 45, interleaving that anchor with distractors, and then constructing a pseudo-long video by concatenation. Candidate triples are filtered in two ways: a Visual-Similarity Filtering step computes DINOv2 embeddings and discards or replaces distractors whenever cosine similarity exceeds $A \subseteq \mathcal{A}$ 46, and an optional Question-Specificity Filtering step uses a large LLM to verify that the question requires at least two distinct visual cues from the anchor. The DPO objective is then modified through an anchor-only approximation

$A \subseteq \mathcal{A}$ 47

because the reference model is a short-clip model and degrades on full long-context input (Huang et al., 2 Feb 2026).

Stage 2 operates on real long videos without long-video annotations. A recursive captioning pipeline produces scene-level metadata by prompting a short-context captioner on each scene together with the history of previous captions. An external LLM then generates multi-segment reasoning queries and a chain-of-thought $A \subseteq \mathcal{A}$ 48 citing exact scene IDs, from which the relevant-scene set $A \subseteq \mathcal{A}$ 49 is extracted. Preferred responses are produced from the full video, while dispreferred responses are induced through either Partial Evidence, which supplies only the scenes in $A \subseteq \mathcal{A}$ 50, or Irrelevant Hallucination, which supplies only the complement $A \subseteq \mathcal{A}$ 51. Stage 2 uses standard DPO, with the Stage 1 checkpoint frozen as $A \subseteq \mathcal{A}$ 52 and the model initialized from Stage 1 (Huang et al., 2 Feb 2026).

The concrete setup uses InternVL-2.5-8B, $A \subseteq \mathcal{A}$ 53 synthetic Stage 1 triples from LLaVA-Video-178K, $A \subseteq \mathcal{A}$ 54 Stage 2 long videos from Vript, and a total of $A \subseteq \mathcal{A}$ 55 synthetic examples with no human annotations. End-to-end fine-tuning is performed for $A \subseteq \mathcal{A}$ 56 epoch per stage, with batch size $A \subseteq \mathcal{A}$ 57, learning rate $A \subseteq \mathcal{A}$ 58, warmup $A \subseteq \mathcal{A}$ 59, cosine decay, $A \subseteq \mathcal{A}$ 60, and $A \subseteq \mathcal{A}$ 61. On LVBench, accuracy rises from $A \subseteq \mathcal{A}$ 62 to $A \subseteq \mathcal{A}$ 63 after Stage 1 and to $A \subseteq \mathcal{A}$ 64 after Stage 2; on LongVideoBench from $A \subseteq \mathcal{A}$ 65 to $A \subseteq \mathcal{A}$ 66 to $A \subseteq \mathcal{A}$ 67; on MLVU from $A \subseteq \mathcal{A}$ 68 to $A \subseteq \mathcal{A}$ 69 to $A \subseteq \mathcal{A}$ 70; on Video-MME from $A \subseteq \mathcal{A}$ 71 to $A \subseteq \mathcal{A}$ 72 to $A \subseteq \mathcal{A}$ 73; and on MVBench from $A \subseteq \mathcal{A}$ 74 to $A \subseteq \mathcal{A}$ 75 to $A \subseteq \mathcal{A}$ 76 (Huang et al., 2 Feb 2026).

The ablations emphasize the dependence on the anchored construction rather than on long-context brute force alone. Removing scene-similarity filtering drops approximately $A \subseteq \mathcal{A}$ 77 points, using $A \subseteq \mathcal{A}$ 78 on the full $A \subseteq \mathcal{A}$ 79 is slower and $A \subseteq \mathcal{A}$ 80 less effective, and scaling composite length from $A \subseteq \mathcal{A}$ 81 to $A \subseteq \mathcal{A}$ 82 frames steadily raises performance with no saturation up to $A \subseteq \mathcal{A}$ 83 frames. Under “padding into a $A \subseteq \mathcal{A}$ 84 grid” tests, InternVL2.5 collapses near the grid center, while LongVPO remains flat across all positions, showing no “lost in the middle” effect (Huang et al., 2 Feb 2026).

Across the literatures surveyed here, the main limitations are also explicit. In the original justification model, PARROT is lexicographic, so it cannot model trade-offs between justification cost and true-preference cost, and $A \subseteq \mathcal{A}$ 85 is not necessarily a welfare ranking (Ridout, 2020). In rationale-enriched preference learning for LLMs, the reported experiments only go up to $A \subseteq \mathcal{A}$ 86 models and approximately $A \subseteq \mathcal{A}$ 87 samples, focus on paired preferences, and incur computational overhead because adding rationales roughly doubles sequence length in training (Just et al., 2024). A plausible implication is that the unifying idea of anchoring preferences to reasons is stable across domains, but the operational meaning of “reason” differs: a justificatory ordering in choice theory, a free-form text explanation in language-model alignment, a structured critique in multimodal reward modeling, or a scene-grounded reasoning trace in long-video optimization.

Markdown Report Issue Upgrade to Chat

References (5)

A Model of Justification (2020)

Data-Centric Human Preference Optimization with Rationales (2024)

Revealed Invariant Preference (2024)

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization (2026)

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Preference-Anchored Rationalization (PARROT).