LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs

Published 16 Aug 2024 in cs.CL | (2408.08656v2)

Abstract: We present the first systematic evaluation examining format bias in performance of LLMs. Our approach distinguishes between two categories of an evaluation metric under format constraints to reliably and accurately assess performance: one measures performance when format constraints are adhered to, while the other evaluates performance regardless of constraint adherence. We then define a metric for measuring the format bias of LLMs and establish effective strategies to reduce it. Subsequently, we present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping -- covering 15 widely-used formats. Our evaluation on eight generation tasks uncovers significant format bias across state-of-the-art LLMs. We further discover that improving the format-instruction following capabilities of LLMs across formats potentially reduces format bias. Based on our evaluation findings, we study prompting and fine-tuning with synthesized format data techniques to mitigate format bias. Our methods successfully reduce the variance in ChatGPT's performance among wrapping formats from 235.33 to 0.71 (%$^2$).

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper identifies significant output format bias in LLMs, with notable discrepancies observed in MCQ, wrapping, list, and mapping formats.
It introduces specific evaluation metrics including SysE, TrueE, and the EstTrueE estimator to quantify performance differences.
The study demonstrates that strategies such as integrating formatted demonstrations, repeating instructions, and fine-tuning with diverse data effectively reduce format bias.

LLMs Are Biased Towards Output Formats: An Analysis

Introduction

The paper "LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs" by Do Xuan Long et al. presents the first systematic evaluation of format bias in LLMs. The study meticulously investigates how LLMs perform differently based on the specified output formats and proposes effective strategies to mitigate this bias. The authors analyze the models' performance under two categories: adherence to format constraints and performance irrespective of these constraints. This dual approach is intended to reliably assess and reduce format bias, ensuring that LLMs can be utilized practically across diverse applications without performance discrepancies due to output format variations.

Methodology

The paper's methodology involves defining a metric to quantify format bias and formulating strategies to mitigate it. The authors evaluate format bias across four main categories: multiple-choice question-answering (MCQ), wrapping formats for isolating final answers, lists, and mapping (dictionaries). The systematic evaluation uses widely accepted datasets and state-of-the-art models, and employs comprehensive metrics for fairness and reliability.

Format Bias Evaluation Metrics

Two key evaluation metrics are introduced:

Systematic Evaluation Score (SysE): Evaluates model performance strictly adhering to the format constraints.
True Evaluation Score (TrueE): Measures the model’s actual performance disregarding format constraints but is challenging to automate.

To manage the complexity of measuring TrueE, the paper proposes an estimator — EstTrueE — which offers a practical means to derive TrueE reliably for large-scale experiments.

Format Categories and Metrics

The study spans four output format categories:

MCQ Answer Formats: Evaluates character identifiers and choice values.
Wrapping Formats: Covers special characters, bolding, italicizing, brackets, parentheses, placeholders, and quoting.
List Formats: Includes Python lists, bullet-point lists, character-separated lists, and newline-separated lists.
Mapping Formats: Encompasses Python dictionaries/JSON and YAML formats.

Key Findings

MCQ Formats: The analysis reveals a substantial bias towards character-based identifiers over textual values. This bias is particularly pronounced in Mistral, which shows a performance discrepancy of 54.47% between the two MCQ formats.
Wrapping Formats: Across the evaluated models, there is significant bias with differing performance across seven wrapping methods. For instance, “Placeholder” wrapping achieves the highest performance, while “Quoting” methods exhibit the lowest due to the models often misunderstanding the formatting instructions.
List Formats: Mistral demonstrates the most significant bias, particularly underperforming in “Bullet-point” lists, while ChatGPT and Gemma exhibit more consistent performance across different formats.
Mapping Formats: Both open-source models displayed significant performance variance between JSON and YAML formats, with ChatGPT showing relatively less bias.

Mitigation Strategies

The authors propose three practical strategies for mitigating format bias:

Integrating Demonstrations: Adding a few formatted demonstrations improves the model's ability to follow format instructions and significantly reduces performance variance among different formats.
Repeating Format Instructions: Simple repetition of format instructions also helps in reducing format bias by reinforcing what is expected.
Fine-Tuning with Formatted Data: The study finds that fine-tuning LLMs with data synthesized to various formatting requirements can substantially reduce performance variance, as evidenced by a drastic decrease in ChatGPT’s performance variance among wrapping formats.

Implications and Future Work

The research highlights the need for developing LLMs that are robust and reliable across multiple output formats, addressing potential fairness and reproducibility issues in real-world applications. The findings also indicate that current powerful models like ChatGPT still exhibit detectable format bias, underscoring the importance of further fine-tuning and evaluation.

Future research directions could focus on evaluating format bias in more complex and nuanced tasks, exploring architecture-level solutions, and refining fine-tuning methods for reducing inherent token biases in LLMs. Additionally, expanding this research to encompass more models and formats will provide a broader understanding of the impact of format bias in practical settings.

Conclusion

This paper makes a significant contribution to understanding and mitigating format bias in LLMs. Systematically evaluating the performance across various output formats and proposing actionable recommendations for bias mitigation offers a robust framework for future research. Reducing format bias is crucial for ensuring the fair and practical deployment of LLMs in diverse real-world applications, underscoring the ongoing need for meticulous attention to format-specific evaluations and model fine-tuning.