Format Consistency Score (FC-score)

Updated 31 October 2025

Format Consistency Score (FC-score) is a quantitative measure that evaluates ML model reliability by checking output invariance under superficial input changes.
It incorporates methods such as consistency rate, expert ordinal ratings, and Boolean feasibility checks to ensure robustness and reproducibility.
FC-score is critical for applications in NLP, clinical summarization, and binary classification, providing actionable insights into model stability.

The Format Consistency Score (FC-score) encompasses a set of quantitative metrics and procedures used to evaluate the robustness, stability, and fidelity of machine learning systems—particularly in natural language processing—when subjected to input variations in format, aggregation, or syntactic structure. The term FC-score is most commonly associated with robustness and repeatability under non-adversarial input perturbations, but usage in the literature extends to text data cleaning, performance score verification, and factual evaluation heuristics. Leading implementations include the Consistency Rate in the SCORE framework (Nalbandyan et al., 28 Feb 2025), expert-graded FC-scores for factual consistency (Luo et al., 2024), and Boolean consistency scores for validity checking of reported metrics (Fazekas et al., 2023).

1. Definition and Conceptual Scope

Format Consistency Score (FC-score) typically denotes a measure of how reliably a model under test produces identical (or equivalent) outputs in response to the same semantic inputs with altered superficial formatting. Format refers to stylistic, structural, or procedural alterations that should not affect underlying task semantics—examples include prompt paraphrasing, option order shuffling, sentence wrapping, and other non-adversarial cosmetic changes.

In formal terms, FC-score operationalizes the expectation:

$\mathrm{FC\textrm{-}score} = 1$ if predictions are invariant to chosen format variations.
$\mathrm{FC\textrm{-}score} = 0$ if predictions are completely unreliable under these perturbations.

The precise computation is context-dependent, encompassing output-agreement (SCORE), expert rating scales (TreatFact), and explicit feasibility (confusion matrix consistency).

2. Mathematical Formalization and Algorithms

In LLM robustness, FC-score is formalized as the Consistency Rate (CR):

$CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$

Where:

$Q$ is a dataset of questions.
$Y_k$ is the set of model predictions for question $Q_k$ across $m$ format variants (prompt, choice order, random seed).
$\mathrm{sim}(y_i, y_j)$ is 1 if outputs are equivalent (identical class label, symbolically same math answer), else 0.

Example Calculation (Multiple-Choice):

For five format variants, and predictions “A”, “A”, “B”, “A”, “A”, there are $10$ pairs, $6$ matches: $\mathrm{FC\textrm{-}score} = 0$ 0.

In factual evaluation (TreatFact clinical benchmark), FC-score is an expert-graded ordinal rating:

0 = completely inconsistent
1 = major factual errors
2 = minor factual errors
3 = fully consistent

For performance reporting and discriminative analysis, scores are binarized (3 = consistent, <3 = inconsistent), and overall system accuracy is measured via balanced accuracy:

$\mathrm{FC\textrm{-}score} = 0$ 1

FC-score as a mathematical truth-criterion: Given reported performance metrics ( $\mathrm{FC\textrm{-}score} = 0$ 2, $\mathrm{FC\textrm{-}score} = 0$ 3, $\mathrm{FC\textrm{-}score} = 0$ 4 for accuracy, sensitivity, specificity) and test set counts ( $\mathrm{FC\textrm{-}score} = 0$ 5), the FC-score (editor’s term) is:

$\mathrm{FC\textrm{-}score} = 0$ 6- $\mathrm{FC\textrm{-}score} = 0$ 7 if there exists a feasible confusion matrix $\mathrm{FC\textrm{-}score} = 0$ 8 satisfying all reported metrics under rounding uncertainty.
$\mathrm{FC\textrm{-}score} = 0$ 9- $CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$ 0 otherwise.

This is determined via exhaustive search, analytic inversion, or integer linear programming (for cross-validation or aggregated results).

Table: FC-score usage by domain

Domain	FC-score Purpose	Calculation Principle
LLM Robustness	Output agreement across format	Eq. for CR, average over variants
Clinical Summ. Eval	Expert rating, binarization	Ordinal scale, balanced accuracy
Binary Classif.	Metric report integrity	Boolean feasiblity, ILP, search space

3. Practical Applications and Experimental Insights

SCORE experiments evidence that:

FC-score/CR uncovers marked instability, with accuracy and FC-score sometimes decoupled.
Models such as Llama-3.1 405B reach $CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$ 1 CR on AGIEval; others dip below $CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$ 2.
Format robustness is application-critical: higher FC-score implies reliability in deployed, user-facing settings where wording and minor format choices are not controlled.

Analysis on TreatFact finds that:

Even top proprietary LLMs (GPT-4) achieve only $CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$ 3 balanced accuracy on clinical consistency evaluation.
Prior metrics and open-source LLMs fall to chance.
FC-score protocols here underscore the need for aspect-based factuality and richer benchmarks.

In medical imaging segmentation and diagnostic prediction, FC-score consistency tests:

Identified as many as $CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \frac{1}{\binom{|Y_k|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$ 4 of published results in a subfield as mathematically inconsistent relative to ground truth, prompting corrections across multiple papers.
Automated FC-score validation accelerates review cycles.

4. Evaluation Protocols and Limitations

In robustness studies (SCORE), FC-score is evaluated via large-scale, repeated model queries per dataset-permutation and averaged for leaderboard comparison.
In clinical/factual domains, FC-score depends on reliable expert annotation and binarization with balanced accuracy to mitigate class imbalance.
For classification integrity, FC-score computation assumes all reported metrics and sample sizes are disclosed; limitations emerge under unspecified cross-validation aggregation or missing denominator information.
In textual data cleaning (n-gram approaches, (Chiu et al., 2021)), consistency score is a related metric but distinct from FC-score in robustness; it tracks regularity with respect to expectations from language modeling rather than output format invariance.

FC-score differs from format faithfulness rate (FFR, (Yao et al., 2024)), which is strictly triggered by deterministic format checker validation.
Distinct from factual consistency scores in summarization (Guo et al., 2022, Bishop et al., 2023)—here, "factual consistency" typically refers to entailment, supported by reference models or neural metrics, rather than cross-format output invariance.
BERTScore, BLANC, ESTIME are focused on semantic or factual alignment, not format consistency sensu stricto.

6. Significance and Future Directions

The FC-score, in its various operationalizations, serves as a foundational measure for deployability, repeatability, and reliability in contemporary NLP and ML. Increasingly, research communities expect robust systems that do not overfit to single prompt choices or special-case formatting. In clinical and scientific domains, FC-scores (whether formal output agreement, expert ordinal ratings, or mathematical feasibility checks) are central to reproducibility and trust.

Emergent directions include:

Joint optimization of FC-score and performance, especially in LLM training and reinforcement pipelines.
Fine-grained aspects: axis-wise FC-score computation (e.g., population/intervention in clinical summaries).
Automated integration into paper review and meta-analysis tools (e.g., open-source packages for large-scale automated consistency verification).

7. Summary Table: Representative FC-score Implementations

Paper/Framework	Definition/Formula	Domain/Application
SCORE (Nalbandyan et al., 28 Feb 2025)	$CR = \frac{1}{\|Q\|} \sum_{Q_k \in Q} \frac{1}{\binom{\|Y_k\|}{2}} \sum_{y_i \in Y_k} \sum_{y_j \in Y_k, j > i} \mathrm{sim}(y_i, y_j)$ 5 (see above)	LLM robustness
TreatFact (Luo et al., 2024)	0–3 expert rating, balanced accuracy	Clinical summary evaluation
Consistency Test (Fazekas et al., 2023)	Boolean feasible solution existence	Binary classification report

FC-score is foundational in evaluating the reliability and robustness of model outputs, directly influencing research integrity, deployment safety, and the interpretability of ML systems across scientific and engineering fields.

Markdown Report Issue Upgrade to Chat

References (7)

SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models (2025)

Factual consistency evaluation of summarization in the Era of large language models (2024)

Testing the Consistency of Performance Scores Reported for Binary Classification Problems (2023)

On consistency scores in text data with an implementation in R (2021)

ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks (2024)

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency (2022)

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Format Consistency Score (FC-score).

Format Consistency Score (FC-score)

1. Definition and Conceptual Scope

2. Mathematical Formalization and Algorithms

2.1 Consistency Rate (SCORE Framework, (Nalbandyan et al., 28 Feb 2025))

Example Calculation (Multiple-Choice):

2.2 Factual Consistency FC-score (Luo et al., 2024)

2.3 Format Consistency Score (Verification for Binary Classification, (Fazekas et al., 2023))

Table: FC-score usage by domain

3. Practical Applications and Experimental Insights

3.1 Robustness and Repeatability in LLMs (Nalbandyan et al., 28 Feb 2025)

3.2 Clinical and Scientific Summarization (Luo et al., 2024)

3.3 Score Verification and Meta-Research (Fazekas et al., 2023)

4. Evaluation Protocols and Limitations

6. Significance and Future Directions

7. Summary Table: Representative FC-score Implementations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Format Consistency Score (FC-score)

1. Definition and Conceptual Scope

2. Mathematical Formalization and Algorithms

2.1 Consistency Rate (SCORE Framework, (Nalbandyan et al., 28 Feb 2025))

Example Calculation (Multiple-Choice):

2.2 Factual Consistency FC-score (Luo et al., 2024)

2.3 Format Consistency Score (Verification for Binary Classification, (Fazekas et al., 2023))

Table: FC-score usage by domain

3. Practical Applications and Experimental Insights

3.1 Robustness and Repeatability in LLMs (Nalbandyan et al., 28 Feb 2025)

3.2 Clinical and Scientific Summarization (Luo et al., 2024)

3.3 Score Verification and Meta-Research (Fazekas et al., 2023)

4. Evaluation Protocols and Limitations

5. Relationships to Related Metrics and Benchmarks

6. Significance and Future Directions

7. Summary Table: Representative FC-score Implementations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics