Papers
Topics
Authors
Recent
2000 character limit reached

FACTS Leaderboard Suite

Updated 14 December 2025
  • FACTS Leaderboard Suite is a modular framework that standardizes multifactorial model evaluations across diverse domains such as NLP, deepfake detection, and scientific extraction.
  • It employs comprehensive metric suites and robust protocols to measure factual accuracy, grounding, fairness, and stability, ensuring methodological rigor.
  • The suite features customizable scoring, adversary-resistant protocols, and open-source toolkits that foster reproducible and transparent scientific progress.

The FACTS Leaderboard Suite is a collection of systems, frameworks, and benchmarking protocols designed for systematic, standardized, and multifactorial evaluation of models in research domains ranging from natural language processing and deepfake detection to scientific results extraction and ranking fairness. Across its implementations, FACTS emphasizes factual fidelity, transparency, methodological rigor, and robustness. Its modular architecture enables granular performance analysis and cross-domain validation, supports customized scoring, and fosters reproducible scientific progress.

1. Rationale and Foundational Principles

The fundamental motivation for the FACTS Leaderboard Suite is the need for rigorous, transparent, and reproducible benchmarking within rapidly evolving fields where model performance is multidimensional and sensitive to data distribution shifts, evaluation protocol choices, and adversarial manipulations. Existing leaderboards often lack standardized metrics, uniform data splits, or mechanisms to account for robustness. FACTS aims to rectify these gaps by providing:

  • Comprehensive axis coverage: Factual accuracy, robustness, grounding, fairness, stability, and diversity.
  • Standardized metric suites: Explicit formulae and protocols for scoring, error rates, and cross-model comparison.
  • Reproducibility and transparency: Open-source toolkits and clearly documented protocols.
  • Adversary resistance: Private/public splits, eligibility filters, and automated judge ensembles mitigate overfitting and bias.
  • Customizability: Weighting schemes, difficulty stratification, and visual analytics allow detailed, user-driven diagnosis (Cheng et al., 11 Dec 2025, Jacovi et al., 6 Jan 2025, Yang et al., 2018, Mishra et al., 2021, Hou et al., 2019, Dowerah et al., 2 Sep 2025).

2. Evolving Suite Architectures and Evaluation Protocols

The FACTS Leaderboard Suite is realized in several reference implementations:

  • FACTS Leaderboard for LLMs: Composed of four weighted sub-leaderboards.

    • FACTS Multimodal: Evaluates factuality over image-based QA, combining coverage and contradiction verdicts.
    • FACTS Parametric: Assesses closed-book factual recall through direct parametric QA, scored as Accuracy, F1, and hedging.
    • FACTS Search: Quantifies models’ real-world information seeking and synthesis via API usage and answer evaluation.
    • FACTS Grounding: Measures the fidelity of long-form generation explicitly grounded in provided context documents; verdicts and eligibility determined by automated judge models.
    • Overall FACTS Score:

    Scoresuite=14(ScoreMM+AccParam+AccSearch+FGround)\mathrm{Score}_{suite} = \frac{1}{4}\left(\mathrm{Score}_{MM} + \mathrm{Acc}_{Param} + \mathrm{Acc}_{Search} + F_{Ground}\right)

(Cheng et al., 11 Dec 2025)

  • FACTS Grounding Leaderboard: Reference for document-faithful long-form text generation. Implements two-phase evaluation (eligibility then factual grounding), judge ensembling, and strict context fidelity. (Jacovi et al., 6 Jan 2025)
  • Speech DF Arena: Standardizes audio deepfake detection benchmarking across 14 datasets with protocols enforcing EER, pooled EER, Accuracy, F1, and optional minDCF, plus an open-source Python toolkit for reproducible evaluation and leaderboard integration. (Dowerah et al., 2 Sep 2025)
  • Ranking Facts Nutritional Label: Implements leaderboard analysis with widgets for transparency (score recipe, attribute influence), stability, fairness (multi-test), diversity (entropy measures), and noise/top-k perturbation. (Yang et al., 2018)
  • Equitable Evaluation and Customization Tool: Leaderboard customization via sample difficulty weighting (using spurious bias, OOD similarity, softmax confidence), continuous/discrete weighting, split-based analysis, and visualization. Enables adversarial stress-testing and focus-area performance selection. (Mishra et al., 2021)
  • TDMS-IE Automated Extraction: End-to-end pipeline for scientific leaderboards—extracts (Task, Dataset, Metric, Score) quadruples from research papers using NLI-style Transformer classifiers, PDF parsing, and table analysis. (Hou et al., 2019)

3. Metric Formulations and Automated Judging

The suite rigorously defines and implements metrics for factuality, grounding, error rates, fairness, and robustness. Key equations include:

  • FACTS Multimodal Accuracy:

ScoreMM=1Ni=1NI[Ci=1    Ni=1]\mathrm{Score}_{MM} = \frac{1}{N}\sum_{i=1}^N \mathbb{I}[C_i=1\;\wedge\;N_i=1]

where CiC_i and NiN_i are coverage and contradiction verdicts.

  • FACTS Grounding Score (v2):

FGround=1Ni=1N(Ei(Gi,1=1Gi,2=1))F_{Ground} = \frac{1}{N}\sum_{i=1}^N \bigl(E_i\wedge \bigl(G_{i,1}=1\wedge G_{i,2}=1\bigr)\bigr)

where EiE_i is eligibility, Gi,1G_{i,1} and Gi,2G_{i,2} are groundings from two judge models.

  • Speech DF Arena EER:

EER=θ,θ=argminθPFA(θ)PFR(θ)\text{EER} = \theta^*,\quad \theta^* = \arg\min_\theta \bigl|P_{FA}(\theta) - P_{FR}(\theta)\bigr|

with PFA(θ)P_{FA}(\theta) the false acceptance rate at threshold θ\theta.

  • Leaderboard Weighted Score (Difficulty Weighting):

Metric=iTKiwiiTdwi×100\mathrm{Metric} = \frac{ \sum_{i\in T} K_i w_i }{ \sum_{i\in T} d w_i } \times 100

with wiw_i the difficulty-based weight, KiK_i the reward/penalty, and dd the correct multiplier (Mishra et al., 2021).

Metrics are backed by automated judge models (e.g., Gemini, GPT, Claude), validated via macro-F1 and human annotation. Mitigation of judge bias (self-preference) is achieved through ensembling and ranking fusion (Cheng et al., 11 Dec 2025, Jacovi et al., 6 Jan 2025).

4. Dataset Design, Task Diversity, and Cross-Domain Robustness

Dataset construction emphasizes domain coverage, attack diversity, and annotation fidelity:

  • Speech DF Arena: 14 corpora spanning English and Mandarin, studio-quality and in-the-wild recordings, TTS, VC, codec, and partial-spoof manipulations.
  • FACTS Grounding: >1,700 prompts across legal, medical, technical, and financial domains, with documents up to 32K tokens, user requests demanding nontrivial synthesis.
  • FACTS Multimodal, Parametric, Search: Balanced image categories, adversarial filtering, real user-traffic queries, multi-hop knowledge graph questions, and hard tail subsets.

Cross-domain generalization is a central concern—models show major degradation under distributional shift, noise, and language mismatch. For example, Speech DF Arena finds EER increases up to 15x for raw waveform CNNs when moving from studio TTS to noisy celebrity deepfakes, and single-LLMs fail on cross-lingual tracks unless their backbone is multilingual (Dowerah et al., 2 Sep 2025).

5. Implementation Toolkits, Customization, and Interactive Analysis

FACTS Suite implementations provide open-source toolkits, leaderboard APIs, and interactive visualization tools:

  • Speech DF Arena toolkit: Python package with Dataset, Scorer, and Runner classes for standardized evaluation; integration with Huggingface Spaces for automatic leaderboard posting.
  • Leaderboard Customization Tool: Web UI supporting selection of evaluation metric (WSBias/WOOD/WMProb), split configuration, reward/penalty assignment, and real-time update of model ranks and difficulty analysis (Mishra et al., 2021).
  • TDMS-IE pipeline: Modular micro-services for PDF/table parsing, NLI inference, database integration, and human-in-the-loop curation. Extraction of TDMS tuples supports dynamic leaderboard growth and zero-shot classification for novel triples (Hou et al., 2019).
  • Ranking Facts Widgets: Transparency (score composition and attribute impact), stability (slope fitting), fairness (multi-test), diversity (entropy, histograms), and top-kk sensitivity.

User studies (n=32, industry teams) demonstrate a mean 41% reduction in pre-deployment development and testing effort via FACTS leaderboard customization and difficulty-weighted analysis (Mishra et al., 2021).

6. Integrity, Governance, and Future Directions

Maintaining robustness and adversary resistance is achieved through:

  • Private splits and blind evaluation: Public/private testsets reduce overfitting and leaderboard hacking.
  • Eligibility filtering: Two-phase scoring prevents gaming by trivial responses.
  • Automated judge versioning and human validation: Periodic updates based on manual review maintain score reliability.
  • Human-in-the-loop curation: Rapid correction cycles paired with active taxonomy growth ensure leaderboard relevance and precision (Hou et al., 2019).

Future development areas include:

  • Incorporation of multi-modality (audio, video, visual grounding)
  • Enhanced metrics (cost-based minDCF, calibration C_llr, detection-error tradeoff curves)
  • Expansion to additional domains (e.g., scientific ranking, automated leaderboard construction across disciplines)
  • Real-time demos and interactive test APIs within leaderboard spaces
  • Release of unlabeled evaluation sets to further guard against overfitting
  • Adoption of focus-area driven leaderboard customization for targeted model selection and deployment.

7. Impact and Significance

The FACTS Leaderboard Suite provides a reproducible, multifaceted, and transparent standard for model evaluation, enabling stakeholders to track progress on factuality, robustness, fairness, and stability. By integrating modular architecture, open APIs, difficulty weighting, and automated judging, FACTS enables both the scientific community and industry practitioners to diagnose weaknesses, drive targeted improvement, and promote deploying models with verified reliability across use cases. Empirical studies confirm its practical utility in reducing development effort and strengthening model selection procedures (Cheng et al., 11 Dec 2025, Dowerah et al., 2 Sep 2025, Jacovi et al., 6 Jan 2025, Yang et al., 2018, Mishra et al., 2021, Hou et al., 2019).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to FACTS Leaderboard Suite.