Format Consistency Score (FC-score)
- Format Consistency Score (FC-score) is a quantitative measure that evaluates ML model reliability by checking output invariance under superficial input changes.
- It incorporates methods such as consistency rate, expert ordinal ratings, and Boolean feasibility checks to ensure robustness and reproducibility.
- FC-score is critical for applications in NLP, clinical summarization, and binary classification, providing actionable insights into model stability.
The Format Consistency Score (FC-score) encompasses a set of quantitative metrics and procedures used to evaluate the robustness, stability, and fidelity of machine learning systems—particularly in natural language processing—when subjected to input variations in format, aggregation, or syntactic structure. The term FC-score is most commonly associated with robustness and repeatability under non-adversarial input perturbations, but usage in the literature extends to text data cleaning, performance score verification, and factual evaluation heuristics. Leading implementations include the Consistency Rate in the SCORE framework (Nalbandyan et al., 28 Feb 2025), expert-graded FC-scores for factual consistency (Luo et al., 2024), and Boolean consistency scores for validity checking of reported metrics (Fazekas et al., 2023).
1. Definition and Conceptual Scope
Format Consistency Score (FC-score) typically denotes a measure of how reliably a model under test produces identical (or equivalent) outputs in response to the same semantic inputs with altered superficial formatting. Format refers to stylistic, structural, or procedural alterations that should not affect underlying task semantics—examples include prompt paraphrasing, option order shuffling, sentence wrapping, and other non-adversarial cosmetic changes.
In formal terms, FC-score operationalizes the expectation:
- if predictions are invariant to chosen format variations.
- if predictions are completely unreliable under these perturbations.
The precise computation is context-dependent, encompassing output-agreement (SCORE), expert rating scales (TreatFact), and explicit feasibility (confusion matrix consistency).
2. Mathematical Formalization and Algorithms
2.1 Consistency Rate (SCORE Framework, (Nalbandyan et al., 28 Feb 2025))
In LLM robustness, FC-score is formalized as the Consistency Rate (CR):
Where:
- is a dataset of questions.
- is the set of model predictions for question across format variants (prompt, choice order, random seed).
- is 1 if outputs are equivalent (identical class label, symbolically same math answer), else 0.
Example Calculation (Multiple-Choice):
For five format variants, and predictions “A”, “A”, “B”, “A”, “A”, there are $10$ pairs, $6$ matches: 0.
2.2 Factual Consistency FC-score (Luo et al., 2024)
In factual evaluation (TreatFact clinical benchmark), FC-score is an expert-graded ordinal rating:
- 0 = completely inconsistent
- 1 = major factual errors
- 2 = minor factual errors
- 3 = fully consistent
For performance reporting and discriminative analysis, scores are binarized (3 = consistent, <3 = inconsistent), and overall system accuracy is measured via balanced accuracy:
1
2.3 Format Consistency Score (Verification for Binary Classification, (Fazekas et al., 2023))
FC-score as a mathematical truth-criterion: Given reported performance metrics (2, 3, 4 for accuracy, sensitivity, specificity) and test set counts (5), the FC-score (editor’s term) is:
- 6-7 if there exists a feasible confusion matrix 8 satisfying all reported metrics under rounding uncertainty.
- 9-0 otherwise.
This is determined via exhaustive search, analytic inversion, or integer linear programming (for cross-validation or aggregated results).
Table: FC-score usage by domain
| Domain | FC-score Purpose | Calculation Principle |
|---|---|---|
| LLM Robustness | Output agreement across format | Eq. for CR, average over variants |
| Clinical Summ. Eval | Expert rating, binarization | Ordinal scale, balanced accuracy |
| Binary Classif. | Metric report integrity | Boolean feasiblity, ILP, search space |
3. Practical Applications and Experimental Insights
3.1 Robustness and Repeatability in LLMs (Nalbandyan et al., 28 Feb 2025)
SCORE experiments evidence that:
- FC-score/CR uncovers marked instability, with accuracy and FC-score sometimes decoupled.
- Models such as Llama-3.1 405B reach 1 CR on AGIEval; others dip below 2.
- Format robustness is application-critical: higher FC-score implies reliability in deployed, user-facing settings where wording and minor format choices are not controlled.
3.2 Clinical and Scientific Summarization (Luo et al., 2024)
Analysis on TreatFact finds that:
- Even top proprietary LLMs (GPT-4) achieve only 3 balanced accuracy on clinical consistency evaluation.
- Prior metrics and open-source LLMs fall to chance.
- FC-score protocols here underscore the need for aspect-based factuality and richer benchmarks.
3.3 Score Verification and Meta-Research (Fazekas et al., 2023)
In medical imaging segmentation and diagnostic prediction, FC-score consistency tests:
- Identified as many as 4 of published results in a subfield as mathematically inconsistent relative to ground truth, prompting corrections across multiple papers.
- Automated FC-score validation accelerates review cycles.
4. Evaluation Protocols and Limitations
- In robustness studies (SCORE), FC-score is evaluated via large-scale, repeated model queries per dataset-permutation and averaged for leaderboard comparison.
- In clinical/factual domains, FC-score depends on reliable expert annotation and binarization with balanced accuracy to mitigate class imbalance.
- For classification integrity, FC-score computation assumes all reported metrics and sample sizes are disclosed; limitations emerge under unspecified cross-validation aggregation or missing denominator information.
- In textual data cleaning (n-gram approaches, (Chiu et al., 2021)), consistency score is a related metric but distinct from FC-score in robustness; it tracks regularity with respect to expectations from language modeling rather than output format invariance.
5. Relationships to Related Metrics and Benchmarks
- FC-score differs from format faithfulness rate (FFR, (Yao et al., 2024)), which is strictly triggered by deterministic format checker validation.
- Distinct from factual consistency scores in summarization (Guo et al., 2022, Bishop et al., 2023)—here, "factual consistency" typically refers to entailment, supported by reference models or neural metrics, rather than cross-format output invariance.
- BERTScore, BLANC, ESTIME are focused on semantic or factual alignment, not format consistency sensu stricto.
6. Significance and Future Directions
The FC-score, in its various operationalizations, serves as a foundational measure for deployability, repeatability, and reliability in contemporary NLP and ML. Increasingly, research communities expect robust systems that do not overfit to single prompt choices or special-case formatting. In clinical and scientific domains, FC-scores (whether formal output agreement, expert ordinal ratings, or mathematical feasibility checks) are central to reproducibility and trust.
Emergent directions include:
- Joint optimization of FC-score and performance, especially in LLM training and reinforcement pipelines.
- Fine-grained aspects: axis-wise FC-score computation (e.g., population/intervention in clinical summaries).
- Automated integration into paper review and meta-analysis tools (e.g., open-source packages for large-scale automated consistency verification).
7. Summary Table: Representative FC-score Implementations
| Paper/Framework | Definition/Formula | Domain/Application |
|---|---|---|
| SCORE (Nalbandyan et al., 28 Feb 2025) | 5 (see above) | LLM robustness |
| TreatFact (Luo et al., 2024) | 0–3 expert rating, balanced accuracy | Clinical summary evaluation |
| Consistency Test (Fazekas et al., 2023) | Boolean feasible solution existence | Binary classification report |
FC-score is foundational in evaluating the reliability and robustness of model outputs, directly influencing research integrity, deployment safety, and the interpretability of ML systems across scientific and engineering fields.