Papers
Topics
Authors
Recent
Search
2000 character limit reached

Format Consistency Score (FC-score)

Updated 31 October 2025
  • Format Consistency Score (FC-score) is a quantitative measure that evaluates ML model reliability by checking output invariance under superficial input changes.
  • It incorporates methods such as consistency rate, expert ordinal ratings, and Boolean feasibility checks to ensure robustness and reproducibility.
  • FC-score is critical for applications in NLP, clinical summarization, and binary classification, providing actionable insights into model stability.

The Format Consistency Score (FC-score) encompasses a set of quantitative metrics and procedures used to evaluate the robustness, stability, and fidelity of machine learning systems—particularly in natural language processing—when subjected to input variations in format, aggregation, or syntactic structure. The term FC-score is most commonly associated with robustness and repeatability under non-adversarial input perturbations, but usage in the literature extends to text data cleaning, performance score verification, and factual evaluation heuristics. Leading implementations include the Consistency Rate in the SCORE framework (Nalbandyan et al., 28 Feb 2025), expert-graded FC-scores for factual consistency (Luo et al., 2024), and Boolean consistency scores for validity checking of reported metrics (Fazekas et al., 2023).

1. Definition and Conceptual Scope

Format Consistency Score (FC-score) typically denotes a measure of how reliably a model under test produces identical (or equivalent) outputs in response to the same semantic inputs with altered superficial formatting. Format refers to stylistic, structural, or procedural alterations that should not affect underlying task semantics—examples include prompt paraphrasing, option order shuffling, sentence wrapping, and other non-adversarial cosmetic changes.

In formal terms, FC-score operationalizes the expectation:

  • FC-score=1\mathrm{FC\textrm{-}score} = 1 if predictions are invariant to chosen format variations.
  • FC-score=0\mathrm{FC\textrm{-}score} = 0 if predictions are completely unreliable under these perturbations.

The precise computation is context-dependent, encompassing output-agreement (SCORE), expert rating scales (TreatFact), and explicit feasibility (confusion matrix consistency).

2. Mathematical Formalization and Algorithms

In LLM robustness, FC-score is formalized as the Consistency Rate (CR):

CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)

Where:

  • QQ is a dataset of questions.
  • YkY_k is the set of model predictions for question QkQ_k across mm format variants (prompt, choice order, random seed).
  • sim(yi,yj)\mathrm{sim}(y_i, y_j) is 1 if outputs are equivalent (identical class label, symbolically same math answer), else 0.

Example Calculation (Multiple-Choice):

For five format variants, and predictions “A”, “A”, “B”, “A”, “A”, there are $10$ pairs, $6$ matches: FC-score=0\mathrm{FC\textrm{-}score} = 00.

In factual evaluation (TreatFact clinical benchmark), FC-score is an expert-graded ordinal rating:

  • 0 = completely inconsistent
  • 1 = major factual errors
  • 2 = minor factual errors
  • 3 = fully consistent

For performance reporting and discriminative analysis, scores are binarized (3 = consistent, <3 = inconsistent), and overall system accuracy is measured via balanced accuracy:

FC-score=0\mathrm{FC\textrm{-}score} = 01

FC-score as a mathematical truth-criterion: Given reported performance metrics (FC-score=0\mathrm{FC\textrm{-}score} = 02, FC-score=0\mathrm{FC\textrm{-}score} = 03, FC-score=0\mathrm{FC\textrm{-}score} = 04 for accuracy, sensitivity, specificity) and test set counts (FC-score=0\mathrm{FC\textrm{-}score} = 05), the FC-score (editor’s term) is:

  • FC-score=0\mathrm{FC\textrm{-}score} = 06-FC-score=0\mathrm{FC\textrm{-}score} = 07 if there exists a feasible confusion matrix FC-score=0\mathrm{FC\textrm{-}score} = 08 satisfying all reported metrics under rounding uncertainty.
  • FC-score=0\mathrm{FC\textrm{-}score} = 09-CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)0 otherwise.

This is determined via exhaustive search, analytic inversion, or integer linear programming (for cross-validation or aggregated results).

Table: FC-score usage by domain

Domain FC-score Purpose Calculation Principle
LLM Robustness Output agreement across format Eq. for CR, average over variants
Clinical Summ. Eval Expert rating, binarization Ordinal scale, balanced accuracy
Binary Classif. Metric report integrity Boolean feasiblity, ILP, search space

3. Practical Applications and Experimental Insights

SCORE experiments evidence that:

  • FC-score/CR uncovers marked instability, with accuracy and FC-score sometimes decoupled.
  • Models such as Llama-3.1 405B reach CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)1 CR on AGIEval; others dip below CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)2.
  • Format robustness is application-critical: higher FC-score implies reliability in deployed, user-facing settings where wording and minor format choices are not controlled.

Analysis on TreatFact finds that:

  • Even top proprietary LLMs (GPT-4) achieve only CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)3 balanced accuracy on clinical consistency evaluation.
  • Prior metrics and open-source LLMs fall to chance.
  • FC-score protocols here underscore the need for aspect-based factuality and richer benchmarks.

In medical imaging segmentation and diagnostic prediction, FC-score consistency tests:

  • Identified as many as CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)4 of published results in a subfield as mathematically inconsistent relative to ground truth, prompting corrections across multiple papers.
  • Automated FC-score validation accelerates review cycles.

4. Evaluation Protocols and Limitations

  • In robustness studies (SCORE), FC-score is evaluated via large-scale, repeated model queries per dataset-permutation and averaged for leaderboard comparison.
  • In clinical/factual domains, FC-score depends on reliable expert annotation and binarization with balanced accuracy to mitigate class imbalance.
  • For classification integrity, FC-score computation assumes all reported metrics and sample sizes are disclosed; limitations emerge under unspecified cross-validation aggregation or missing denominator information.
  • In textual data cleaning (n-gram approaches, (Chiu et al., 2021)), consistency score is a related metric but distinct from FC-score in robustness; it tracks regularity with respect to expectations from language modeling rather than output format invariance.
  • FC-score differs from format faithfulness rate (FFR, (Yao et al., 2024)), which is strictly triggered by deterministic format checker validation.
  • Distinct from factual consistency scores in summarization (Guo et al., 2022, Bishop et al., 2023)—here, "factual consistency" typically refers to entailment, supported by reference models or neural metrics, rather than cross-format output invariance.
  • BERTScore, BLANC, ESTIME are focused on semantic or factual alignment, not format consistency sensu stricto.

6. Significance and Future Directions

The FC-score, in its various operationalizations, serves as a foundational measure for deployability, repeatability, and reliability in contemporary NLP and ML. Increasingly, research communities expect robust systems that do not overfit to single prompt choices or special-case formatting. In clinical and scientific domains, FC-scores (whether formal output agreement, expert ordinal ratings, or mathematical feasibility checks) are central to reproducibility and trust.

Emergent directions include:

  • Joint optimization of FC-score and performance, especially in LLM training and reinforcement pipelines.
  • Fine-grained aspects: axis-wise FC-score computation (e.g., population/intervention in clinical summaries).
  • Automated integration into paper review and meta-analysis tools (e.g., open-source packages for large-scale automated consistency verification).

7. Summary Table: Representative FC-score Implementations

Paper/Framework Definition/Formula Domain/Application
SCORE (Nalbandyan et al., 28 Feb 2025) CR=1QQkQ1(Yk2)yiYkyjYk,j>isim(yi,yj)CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)5 (see above) LLM robustness
TreatFact (Luo et al., 2024) 0–3 expert rating, balanced accuracy Clinical summary evaluation
Consistency Test (Fazekas et al., 2023) Boolean feasible solution existence Binary classification report

FC-score is foundational in evaluating the reliability and robustness of model outputs, directly influencing research integrity, deployment safety, and the interpretability of ML systems across scientific and engineering fields.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Format Consistency Score (FC-score).