Counterfactual Instruction Following

Updated 22 February 2026

Counterfactual instruction following is the ability of AI systems to modify behavior when given redundant, contradictory, or adversarial instructions.
It employs metrics like SustainScore and Degree of Contrast to quantify performance degradation and pinpoint failure modes under counterfactual scenarios.
This topic has practical applications across NLP, vision–language tasks, and autonomous control, promoting robust model design and safer AI deployment.

Counterfactual instruction following refers to the capacity of AI systems—especially LLMs and multimodal models—to modify their behavior or output in response to requirements that are contrary to their pre-trained tendencies, performance defaults, or the prevailing context, often for the purpose of robustness assessment, simulation fidelity, or data augmentation. Research across natural language processing, vision–language, and embodied control has operationalized this concept as the model's ability to follow instructions that are counterfactual, self-evident, semantically misaligned, or intentionally adverse, with the aim of revealing critical failure modes, quantifying robustness, or enhancing downstream instruction-following capabilities (Qi et al., 29 Jan 2026, Kumar et al., 8 Apr 2025, Li et al., 2023, Glossop et al., 19 Aug 2025).

1. Formalizations and Core Concepts

The term "counterfactual instruction" encompasses three major operationalizations in the literature:

Self-evident constraint insertion: Augmenting prompts with explicit requirements already satisfied by the model in the unconstrained condition (e.g., asking for an output to include a keyword or use a certain structure that naturally occurs). The expectation is that such counterfactual additions are logically redundant, so a robust model should maintain its prior success rate. Degradation when such constraints are imposed directly quantifies interference from instruction adherence (Qi et al., 29 Jan 2026).
Contradictory output mapping: Manipulating target label verbalizers in classification tasks to be semantically unnatural or reversed (e.g., "Say ‘negative’ for a positive sentiment"), thus requiring the model to ignore or override semantic priors and follow the explicit verbal instruction. This approach exposes whether instruction-following reflects true command adherence or is camouflaged by pre-training biases (Li et al., 2023).
Persona or performance reversal: Directing a model to simulate a persona with performance antithetical to its default—for instance, producing low-quality answers when it is otherwise a high-performing reasoner. Effective counterfactual instruction following in this regime is measured by the model’s ability to generate underperforming, plausible, and stylistically accurate responses when prompted accordingly (Kumar et al., 8 Apr 2025).

Table: Operationalizations of Counterfactual Instruction Following

Approach	Main Paper	Counterfactuality Mode
Self-evident constraint addn.	(Qi et al., 29 Jan 2026)	Redundant/obvious instruction insertion
Label verbalizer reversal	(Li et al., 2023)	Output-label mapping contradicts priors
Persona/performance reversal	(Kumar et al., 8 Apr 2025)	Simulate anti-default behavioral outputs

2. Evaluation Metrics and Protocols

Several quantitative frameworks have been introduced:

SustainScore (Qi et al., 29 Jan 2026): For a model $M$ , given a task set $\mathcal{X}$ , SustainScore is the expected accuracy on tasks originally solved by $M$ under the insertion of self-evident constraints extracted from the successful output:

$\mathrm{SustainScore}_M(\mathcal{X}) = \mathbb{E}_{x\in\mathcal{X}_{\mathrm{succ}}, C\sim\mathrm{Gen}(M(x))}\left[ J_x(M(x\oplus C)) \right]$

This metric isolates the fraction of tasks in which correct task-solving is preserved under logically redundant constraint augmentation.

Verbalizer Manipulation Accuracy (Li et al., 2023): Given a set of modified verbalizer mappings (natural, neutral, unnatural), model accuracy is measured on the same classification task under each mapping. A robust instruction-following model should sustain accuracy under unnatural or contradictory mappings.
Degree of Contrast (DoC) (Kumar et al., 8 Apr 2025): In persona simulation, DoC scores (1 to 3) quantify how different the high- versus low-performance outputs are in terms of reasoning style and errors, as judged by LLM-based or human raters.
Success Rate (SR) with Counterfactual Labels (Glossop et al., 19 Aug 2025): Used in vision–language–action tasks, where datasets are augmented with counterfactual instructions and corresponding actions. Performance is measured as the average success rate across original and counterfactual instruction–action pairs.

These metrics serve to distinguish between apparent instruction-following induced by model priors versus true sensitivity to arbitrary, even counterfactual, instructions.

3. Empirical Findings: Benchmarks and Failure Modes

Experiments uniformly report substantial performance degradation when models are subjected to counterfactual instructions, with pronounced findings:

Self-evident constraint interference: Adding redundant constraints leads to marked accuracy drops even in state-of-the-art LLMs (e.g., multi-hop QA: Claude-Sonnet-4.5, accuracy drops from 85.0% to 45.1% under self-evident constraints) (Qi et al., 29 Jan 2026). Output constraint satisfaction remains high (>94%) even as core task accuracy collapses, demonstrating the disconnect between constraint adherence and actual instruction following.
Verbalizer manipulation: When presented with unnatural labels (e.g., flipped sentiment tokens), models such as GPT-4 perform at or near random guessing levels (≈50% accuracy) (Li et al., 2023). Zero-shot CoT prompting confers modest gains, but large gaps persist.
Persona-reversal: Directing LLMs to role-play low-performing students often fails to induce significant accuracy drops—for instance, OpenAI-o1's accuracy is 99.0% for both high and low personas in math reasoning, with minimal Degree of Contrast (Kumar et al., 8 Apr 2025). Prompt complexity (e.g., intersectional attributes) further blunts the intended effect.
Vision-language–action: Adding counterfactual language–action pairs (via CAST) increases navigation SR by 27% compared to vanilla instruction-following data, indicating data diversity is critical for robust grounding (Glossop et al., 19 Aug 2025).

Identified failure modes include: (i) reasoning derailment due to excessive attention to constraints, (ii) inability to override entrenched output priors, (iii) prompt interference when simulating multiple counterfactual attributes, and (iv) catastrophic order/structure sensitivity in non-sequential instruction scenarios (Qi et al., 29 Jan 2026, Li et al., 2023, Kumar et al., 8 Apr 2025, Jaffe et al., 26 Jan 2026).

4. Mechanistic and Theoretical Insights

Mechanistic studies provide several key insights:

Attention interference: Constraint tokens receive disproportionate attention in failed generations, particularly in the deeper transformer layers and during later decoding steps. This "over-attending" to instruction tokens disrupts logical reasoning and degrades core performance, as quantified by the Constraint Attention Score (Qi et al., 29 Jan 2026).
Causal modeling of intent/action mapping: Formal counterfactual reasoning frameworks (e.g., the conformal counterfactual generation paradigm) model the interaction loop as a structural causal model $X \rightarrow A \rightarrow Z \rightarrow Y$ , with probabilistic abduction over environment variables to generate behaviorally consistent counterfactual outputs and formal coverage guarantees (Farzaneh et al., 27 Jan 2026).
Information-theoretic leverage: In vision–language–action tasks, introducing counterfactual instruction–action pairs increases the conditional mutual information $I(a;\ell|o)$ , forcing policies to attend more directly to the provided language instruction (CAST) (Glossop et al., 19 Aug 2025).
Stability and cost under partial fulfillment: In counterfactual explanation and recourse, cost incurred under iterative partial fulfillment of counterfactual instructions is closely tied to algorithmic stability. Unstable or randomized CF generators induce oscillation, increased cost, and fairness risks (Zhou, 2023).

5. Applications Across Modalities and Domains

Language-only: Task-solving under redundant, conflicting, or personality altering instructions for math, QA, text classification, and dialogue (Qi et al., 29 Jan 2026, Kumar et al., 8 Apr 2025, Li et al., 2023).
Vision–language–action: Counterfactual instruction–action relabeling for navigation and robotic policies, enhancing generalization and adherence to fine-grained commands (Glossop et al., 19 Aug 2025, Wang et al., 2022).
Conditional generative modeling: Biomedical image generation with progression descriptions as counterfactual instructions (e.g., BiomedJourney), enabling the synthesis of plausible disease progressions or treatment responses (Gu et al., 2023).
Autonomous agent control: Formal frameworks for user-facing, intent-driven agent control with reliable counterfactual introspection and conformal guarantees, particularly in closed-loop and safety-critical settings (Farzaneh et al., 27 Jan 2026).

6. Alignment, Robustness, and Open Challenges

Counterfactual instruction following exposes fundamental obstacles in building instruction-robust AI:

Alignment coverage gap: High instruction-following (IF) scores and raw accuracy metrics do not correlate with robustness under even trivial, logically redundant constraints. Both task-solving models and generative models are susceptible to reasoning derailment when even non-functional constraints are introduced (Qi et al., 29 Jan 2026).
Interference phenomena: Instructional detail (e.g., order of attributes, number of constraints), prompt structure (e.g., sequential vs. non-sequential control flow (Jaffe et al., 26 Jan 2026)), and attribute entanglement produce interference that is not mitigated by scaling or standard alignment protocols (Kumar et al., 8 Apr 2025, Li et al., 2023, Qi et al., 29 Jan 2026).
Mitigation approaches: Augmenting training with synthetic self-evident constraints, constraint-aware RL objectives, attention-regularization strategies, and explicit cycle-consistent or causal learning schemas are proposed as directions to build true counterfactual-following capability (Qi et al., 29 Jan 2026, Glossop et al., 19 Aug 2025, Wang et al., 2022).
Measurement and evaluation: Metrics like SustainScore, Degree of Contrast, Cycle Consistency losses, and information-theoretic proxies are necessary but not sufficient for robust instruction-following assessment; combined evaluation and adversarial benchmarking are required.