Format-Level Negative Bias in LLMs

Updated 21 November 2025

Format-level negative bias is the systematic underperformance of models when outputs follow prescribed formats, leading to skewed results independent of content.
Empirical studies reveal significant variations, with performance drops up to 28.8% in multiple-choice tasks and variance reductions from 235.33%² to 0.71%² after mitigation.
Mitigation strategies, including prompt engineering, fine-tuning with format-specific data, and causal testing, effectively reduce bias and enhance model robustness.

Format-level negative bias denotes the systematic degradation or distortion in outputs or measurements when a model, system, or evaluation pipeline is sensitive to the format in which data, prompts, or responses are structured—irrespective of underlying content or semantics. In LLMs, this bias manifests as consistent performance drops, output distortions, or skewed preference and judgments when required to adhere to specific output schemas (e.g., lists, structured tables, binary vs. continuous labeling). Format-level negative bias is well-documented in LLM evaluation, judgment tasks, RLHF alignment, causal evaluation, and statistical inference pipelines, often with significant implications for reliability, fairness, and scientific validity (Long et al., 16 Aug 2024, Lu et al., 28 Apr 2025, Yuan et al., 26 Sep 2025, Yu et al., 31 Jul 2024, Song et al., 14 Nov 2025, Liu et al., 13 Aug 2025, Zhang et al., 18 Sep 2024, Hamza et al., 2022, Nie et al., 2017).

1. Definitions and Formalization

Format-level negative bias refers to a model’s systematic underperformance, distortion, or undesired skew (typically toward negative or conservative outputs) as a function of the format of the inputs, outputs, or data representations. This phenomenon is characterized by a measurable, statistically robust change in performance or output preference when equivalent content is conveyed under different formats:

In LLMs: A marked drop in accuracy or F1 when the model must produce responses in prescribed formats (e.g., JSON, YAML, lists, or “Yes/No” versus free text), even if the correct answer is otherwise known to the model (Long et al., 16 Aug 2024).
In judgment tasks: A systematic shift toward negative or adverse classifications when using binary output formats versus continuous scales, not explainable by content (Lu et al., 28 Apr 2025, Song et al., 14 Nov 2025).
In data synthesis or meta-analysis: Aggregate data reporting (“AD”) may systematically underestimate effects relative to individual participant data (“IPD”), thus requiring explicit format-level bias adjustments (Hamza et al., 2022).
In RLHF and reward modeling: Reward models, human judges, or GPT-4 often grant systematically higher or lower ranks to responses formatted with specific patterns (lists, bold text, emojis), which can be easily exploited or amplified in downstream alignment (Zhang et al., 18 Sep 2024).

Key formal definitions include:

Estimand Difference: For LLM tasks, let $\text{SysE}$ be the systematic evaluation score (correct and format-adherent answers), $\text{FI}$ the format-instruction following score, and $\text{EstTrueE}$ the normalized estimator of true performance (accounts for format misses) (Long et al., 16 Aug 2024):

$\text{EstTrueE} = \frac{\text{SysE} \times 100}{\text{FI}},\ \text{if } \text{FI} > 0$

The variance of $\text{EstTrueE}$ over $k$ formats defines "format bias."

Preference Ratio (FPR): In heterogeneous evidence tasks, let $F_A$ , $F_B$ denote two formats. The Format Preference Ratio is:

$\text{FPR}_{F_A \vs F_B} = \frac{\#\{\text{Pref-A}\}}{\#\{\text{Pref-A}\} + \#\{\text{Pref-B}\}}$

FPR significantly $<0.5$ indicates negative bias against $F_A$ (Liu et al., 13 Aug 2025).

Judgment Bias Score: In LLM evaluations, for binary vs. continuous responses:

$B = P_\text{binary} - P_\text{continuous}$

Negative $B$ quantifies a "negative" bias in binary format (Lu et al., 28 Apr 2025).

Negative Attention Score (NAS): For binary outputs, NAS captures the excess attention paid to the “No” token versus “Yes,” quantifying attention-pattern-induced negative bias (Yu et al., 31 Jul 2024).

2. Empirical Manifestations and Key Findings

2.1 LLM Output and Evaluation Tasks

Empirical studies reveal substantial format-level negative bias across open-source and commercial LLMs and tasks:

Wrapping formats: In 8 generation tasks across models including ChatGPT and Mistral, $\text{Bias}$ (variance of $\text{EstTrueE}$ across 7 wrapping formats on MMLU, GSM8K) was $235.33\%^2$ pre-mitigation, dropping to $0.71\%^2$ after format-oriented fine-tuning (Long et al., 16 Aug 2024).
List and mapping formats: For Mistral, variance across list formats reached $353.8\%^{2}$ , and mapping format (JSON vs YAML) differences caused up to $32\%$ F1 swings in smaller models.
Multiple-choice QA: Measured accuracy dropped $\sim28.8\%$ when switching from character to text answer formats.
Heterogeneous evidence: LLMs tasked with reasoning over text, tables, infoboxes, or knowledge graphs consistently preferred text, with FPRs for infobox vs text as low as $0.235$ and DCR as low as $3\%$ versus $28$– $53\%$ for homogeneous text-only controls (Liu et al., 13 Aug 2025).
Binary vs. Continuous Output: All tested models exhibited statistically robust negative bias in binary format judgments:

| Model | $\Delta P_\text{Support}$ (value judgment) | $\theta^{bc}$ mean (hier. model) | $\Delta P_\text{Positive}$ (sentiment) | $\theta^{bh}$ mean (sentiment) | | -------------- | ----------------------------------- | ----------------------------- | ------------------------------------ | ------------------------------ | | Llama-3.3-70b | –0.132 | –0.98 | –0.145 | –0.87 | | Qwen-2.5-72b | –0.141 | –1.05 | –0.153 | –0.90 | | Deepseek-v3 | –0.139 | –1.02 | –0.160 | –0.92 | | GPT-4o-mini | –0.122 | –0.96 | –0.140 | –0.83 |

All $95\%$ intervals for hierarchical model bias exclude zero (Lu et al., 28 Apr 2025).

RLHF and Reward Models: Preferences for specific format tokens (bold, list, emoji) reached win-rates $75$– $90\%$ (relative to $50\%$ baseline). Reward model fine-tuning with $<1\%$ attack data elevated list win-rate from $51\%$ to $77.5\%$ (Zhang et al., 18 Sep 2024).

2.2 Adaptive Data Collection

Negative bias arises in sample means under adaptive data collection procedures (e.g., multi-armed bandit experiments). Under natural “exploit” and independence-of-irrelevant-options conditions, the expectation of the sample mean for each arm is strictly less than the arm’s true mean:

$E[\overline X_T^{(k)}] \leq \mu_k$

with detailed quantification and debiasing strategies in (Nie et al., 2017).

3. Mechanisms and Sources

3.1 Structural and Token-Level Effects

Format-token bias: LLMs assign non-uniform prior probability to tokens in different formats; code-familiar delimiters (e.g., $[$ , $]$ ) favor higher FI scores; unusual or ambiguous tokens penalize format adherence (Long et al., 16 Aug 2024).
Format-instruction ambiguity: Formatted outputs may capture unintended context (e.g., bold or parentheses bleeding into rationales), confusing extractors or judges.
Training data skew: Code-heavy pretraining benefits code-like formats (Python list, JSON) but not textual or unconventional wrappers.
Surface-form shortcut: In binary QA, absence of knowledge triggers overproduction of default tokens (“No”), a “shortcut behavior” decoupled from semantic calibration. Attention-based NAS measurements rise sharply in absent-knowledge subsets (Song et al., 14 Nov 2025).
Attention distributions: Imbalanced self-attention across heterogeneous input segments correlates with negative format bias in integrating evidence (Liu et al., 13 Aug 2025).

3.2 Procedural and Modeling Factors

Evaluation metric conflation: Standard metrics often conflate content error and format-nonadherence. “Accuracy with format” cannot disentangle knowledge from instruction-following capacity (Long et al., 16 Aug 2024).
Reward model sensitivity: Minor data poisoning or RLHF best-of-n selection amplify format-based preferences; format features may be easier to exploit (“hacks”) than substantive response quality (Zhang et al., 18 Sep 2024).
Meta-analytic format bias: Aggregate reporting leads to risk of underestimation of effects compared to IPD when not properly modeled via hierarchical bias-correction (Hamza et al., 2022).

4. Quantification Metrics and Causal Assessment

Format-level bias is measured via metrics specifically designed to isolate format effects:

EstTrueE and Bias (variance across formats) (Long et al., 16 Aug 2024).
Format-Preference Ratio (FPR) and Dual Coverage Rate (DCR) for evidence integration (Liu et al., 13 Aug 2025).
Binary bias estimand $B = P_\text{binary} - P_\text{continuous}$ (Lu et al., 28 Apr 2025).
Negative Attention Score (NAS) at head and model level (Yu et al., 31 Jul 2024).
Causal-inference protocols: Randomized “A/B” format assignment plus DAG modeling (collider, m-bias, single-cause) and statistical testing (Cochran’s Q, Stouffer's z-combination) to validate the existence and structure of format-level effects (Yuan et al., 26 Sep 2025).

In 43 of 48 format×dataset scenarios tested in (Yuan et al., 26 Sep 2025), no causal effect of structured format on LLM accuracy was found, indicating that many observed differences could be attributed to evaluation or instruction confounds.

5. Mitigation and Correction Strategies

5.1 Prompt Engineering and Training Interventions

Prompting: Multiple in-context format-consistent demonstrations and repeated directives elevate FI and reduce bias (e.g., 1→5 demos in prompt: FI from $73\%$ to $85\%$ , Bias from $235.33$ to $111.78$) (Long et al., 16 Aug 2024).
Instructional clarity and rephrasing: Controls for ambiguity and monotonic format exposure help.
Fine-tuning with synthesized format data: Additional training examples are distributed inversely proportional to format performance. ChatGPT’s wrapping-format bias dropped from $235.33\%^{2}$ to $0.71\%^{2}$ after fine-tuning to normalize format exposure (Long et al., 16 Aug 2024).
NASA (Negative Attention Score Alignment): Parameter-efficient head-level fine-tuning targeting negative attention heads closes the precision–recall gap, reduces NAS and F1-score differences in binary yes/no tasks, without loss of generalization (Yu et al., 31 Jul 2024).
Reward model regularization: Multi-head reward models (content vs. format), adversarial filtering, explicit penalization of format tokens, and explicit calibration via logistic regression on format features (Zhang et al., 18 Sep 2024).
Meta-analytic bias models: Hierarchical Bayesian modeling with explicit format indicators and penalty priors allows for penalization or down-weighting of high-risk-of-bias (including reporting-format) arms/sources (Hamza et al., 2022).

5.2 Inference-Time and Evaluation Adjustments

Attention reweighting: Differentiable self-attention balancing between evidence formats at each generation step increases DCR, improving dual-coverage integration, but does not fully resolve directionality of preference bias (Liu et al., 13 Aug 2025).
Choice labeling and reformatting: Presenting binary decisions as multiple-choice with matched options (even when both choices are “Yes" or "No”) reduces negative bias (Song et al., 14 Nov 2025).
Continuous scale preference: For judgment tasks, continuous output scales are robust to negative bias relative to binary choices; post-hoc calibration adjusting raw predictions by estimated bias parameter ( $\theta^{bc}$ ) further mitigates distortion (Lu et al., 28 Apr 2025).

6. Implications and Best Practices

Evaluation reporting: Always report both FI (format adherence), EstTrueE (normalized true performance), and explicit format bias (variance or difference metrics) when benchmarking LLMs or comparing systems across formats (Long et al., 16 Aug 2024).
Designing alignment and RLHF: Model and penalize format shortcuts at the reward model and alignment pipeline level. Routinely test for—and if necessary, regularize against—format-driven exploits (Zhang et al., 18 Sep 2024).
Prompt and schema selection: Prefer output formats exhibiting high and stable FI; avoid hard-coding a single style across heterogeneous downstream deployments (Long et al., 16 Aug 2024).
Meta-analysis and evidence synthesis: Explicitly model format as a possible source of systematic bias, especially when combining data from IPD and AD sources (Hamza et al., 2022).
Mitigation of binary-output bias: When possible, replace binary decision tasks with continuous or multiclass alternatives; explicitly offer “I don’t know” choices to prevent forced negative responses when model knowledge is absent (Song et al., 14 Nov 2025).
Format normalization: Consider pre-processing pipelines for structured data (tables, graphs, infoboxes) to reduce format-induced disparities in evidence integration (Liu et al., 13 Aug 2025).
Causal testing: Use randomized, context- and instruction-controlled A/B assessments to detect real format-level effects (and avoid mistaking confounds for genuine causal impact) (Yuan et al., 26 Sep 2025).

7. Broader Context and Future Directions

Format-level negative bias is not restricted to LLMs: it appears in adaptive experimental designs, meta-analytic evidence synthesis, and other multi-format decision/prediction tasks (Nie et al., 2017, Hamza et al., 2022). In statistical and experimental work, correction via selective inference (conditional MLE, data splitting), penalized priors, or hierarchical modeling is essential to restore unbiasedness and maintain inferential validity. In LLM research and deployment, emergent evidence points toward the need for architecture- and training-level remedies (format-agnostic losses, balanced corpora) and rigorously format-agnostic evaluation frameworks.

The pervasiveness, magnitude, and downstream amplifiability of format-level negative bias make it a significant methodological and practical concern across LLM applications, alignment research, and scientific data integration workflows. Ongoing work recommends explicit quantification of format effects, integrated debiasing across preprocessing, training, and evaluation, and multi-layered causal analysis to ensure valid, fair, and reliable modeling across heterogeneous data and application spaces (Long et al., 16 Aug 2024, Yuan et al., 26 Sep 2025, Liu et al., 13 Aug 2025, Song et al., 14 Nov 2025, Lu et al., 28 Apr 2025, Zhang et al., 18 Sep 2024, Hamza et al., 2022, Nie et al., 2017).