Papers
Topics
Authors
Recent
2000 character limit reached

FGVeriBench: Fine-Grained Verification Benchmarks

Updated 14 February 2026
  • FGVeriBench is a dual-domain benchmark offering fine-grained evaluation for both LLM factuality and geometric verification, enhancing nuance in model assessment.
  • It employs a triplet-based ranking system categorizing outputs as correct, intermediate, or incorrect to enable detailed performance analysis.
  • The benchmark integrates systematic dataset construction, robust annotation pipelines, and rigorous metrics for both textual and robotic applications.

FGVeriBench is a term used for two distinct high-quality open benchmarks targeting fine-grained verification in separate domains: (1) factuality in LLM question answering and (2) geometric verification in long-term visual loop closure detection for robotics. Both benchmarks address critical limitations in the granularity of evaluation within their domains by introducing systematically constructed datasets, robust annotation pipelines, and rigorous evaluation metrics. The following entry separately describes these benchmarks as established in recent arXiv literature.

1. Fine-grained Factuality Verification Benchmark for LLMs

1.1 Motivation and Distinction from Previous Work

Existing LLM factuality benchmarks (notably TruthfulQA, HaluEval, FELM, WiCE) primarily adopt binary (correct/incorrect) or sentence-level judgments, failing to discriminate between varying degrees of incorrectness. This severely limits their suitability for nuanced model evaluation, preference optimization in reinforcement learning from human feedback, and generation filtering. FGVeriBench is introduced as the first ranking-based, fine-grained testbed for factuality verification on short question–answer (QA) pairs. Unlike predecessors, it provides explicit annotation of error severity: answers are labeled as "correct," "intermediate" (minor error), or "incorrect" (major factual error or hallucination), supporting nuanced model development and comparative analysis (Huang et al., 7 Jan 2026).

1.2 Dataset Construction Pipeline and Properties

Data Sources and QA Types

FGVeriBench draws general QA questions (single-hop) from Natural Questions (NQ), TriviaQA, and PopQA, as well as multi-hop QA (requiring chained reasoning) from HotpotQA, MuSique, and 2WikiMultihop. This diversification ensures comprehensive coverage of both elementary and compositional reasoning challenges.

Annotation Protocol

  1. Instruction Filtering: Questions are selected if they possess a single, unambiguous ground-truth answer.
  2. Answer Pool Generation: Multiple strong LLMs generate diverse candidate answers for each question.
  3. Initial LLM-Judge Classification: An LLM judge assigns each candidate to one of three classes: Correct, Intermediate, or Incorrect.
  4. Triplet Sampling: For each question, one representative of each LLM-labeled category is randomly selected to form a triplet.
  5. LLM + Human Verification: Three human annotators assess and order the triplets. Triplets lacking at least two annotators in agreement with the LLM are discarded.

Label Taxonomy

Rank Description
1 Correct—fully factual, no errors
2 Intermediate—minor factual inaccuracy or omission
3 Incorrect—major factual error or hallucination

Cohen's κ and observed inter-annotator agreement (64.8%) are reported for reliability analysis.

Dataset Statistics

Domain Questions Avg. Answer Length (tokens)
NQ 203 15.7
TriviaQA 372 13.7
PopQA 365 14.3
HotpotQA 287 16.1
MuSique 154 21.0
2Wiki 298 14.9

Each of the 1,679 triplets contains one answer per rank.

1.3 Benchmark Format and Example Structure

Each dataset record is represented as a JSON object holding the QA instruction, a ground-truth answer (reference), and three candidate answers, each labeled with its rank. Example:

1
2
3
4
5
6
7
8
9
{
  "instruction": "What nationality is Alice Delamar’s father?",
  "reference": "United States",
  "candidates": [
    {"rank":1, "text":"American"},
    {"rank":2, "text":"I don't have information on Alice Delamar."},
    {"rank":3, "text":"Alice Delamar's father’s nationality is French."}
  ]
}

1.4 Evaluation Protocol and Metrics

Ranking Metrics

  • Precision@1 (P@1): Proportion of questions where the model’s top choice aligns with human Rank 1.
  • Kendall’s τ: Measures concordance between model and gold standard triplet ranking; for k=3k=3 candidates, τ=(CD)/[12k(k1)]\tau = (C - D) / [{1 \over 2} k (k-1)] where CC and DD are concordant and discordant pairs.

Binary Adaptation

By merging Rank 2 and 3, standard binary classification metrics (Accuracy, Precision, Recall, F1) can be computed.

Weighted Scoring (Proposed)

A severity-weighted average,

S=1Nicwc[predi=c]S = \frac{1}{N}\sum_{i}\sum_{c} w_c\cdot[\text{pred}_i = c]

where w1=1.0w_1=1.0 (correct), w2=αw_2=\alpha (minor inaccuracy), w3=0.0w_3=0.0, α(0,1)\alpha\in(0,1).

1.5 Comparative Analysis with Existing Benchmarks

Feature TruthfulQA/HaluEval/FELM/WiCE FGVeriBench
Annotation Lvl Binary (sentence/claim) Triplet ranking (answer-level)
Scope Hallucination/summarization Short QA, single/multi-hop
Granularity 2-way 3-way (fine-grained)
Protocol Pointwise Pairwise/ranking

FGVeriBench’s triplet design ensures each question features a "best," "medium," and "worst" answer, providing a robust scaffold for severity-aware ranking and learning (Huang et al., 7 Jan 2026).

2. Geometric Verification Benchmark for Visual Loop Closure Detection

2.1 Purpose and Context

In visual SLAM, loop closure verification is essential to reject erroneous matches and enforce global map consistency. While most benchmarks target image retrieval, pose estimation, or homography, FGVeriBench (also termed GV-Bench) systematically benchmarks the geometric verification stage, where local feature matching and epipolar constraints decide the veracity of loop closure candidates under challenging appearance and viewpoint changes (Yu et al., 2024).

2.2 Dataset Sequences and Environmental Variations

FGVeriBench constructs six image-pair sequences from three public datasets, each covering a distinct combination of environmental condition and viewpoint:

Sequence Dataset/Condition Viewpoint
Day Oxford RobotCar / Day Moderate
Night Oxford RobotCar / Night Moderate, illumination diff
Season RobotCar (Autumn-Winter) Moderate, seasonal diff
Weather RobotCar (Sun-Rain) Moderate, weather diff
Nordland Train video, seasons Negligible, fixed camera
UAcampus Campus (day-night) Strong, path variation

Each sequence comprises approximately 1,400–1,700 queries, with each query matched to top 10–20 database images, generating tens of thousands of evaluation pairs.

2.3 Benchmark Task, Pipeline, and Evaluation Protocol

Loop Closure Verification Pipeline

  1. Keypoint Extraction and Description: Algorithms such as SIFT, SuperPoint, and DISK extract and describe local features.
  2. Feature Matching: Nearest-neighbor or graph neural matcher (e.g., SuperGlue, LightGlue) associates keypoints.
  3. Geometric Verification: RANSAC estimates the fundamental matrix F\mathbf{F} enforcing epipolar geometry x2TFx1=0\mathbf{x}_2^T\mathbf{F}\mathbf{x}_1=0. The set of inliers I\mathcal{I} is extracted.
  4. Binary Decision: Any (qi,ri,j)(q_i, r_{i,j}) is declared a true loop if Iτinlier|\mathcal{I}|\geq \tau_{\textrm{inlier}}.

Evaluation Metrics

  • Precision: P=TPTP+FPP = \displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}
  • Recall: R=TPTP+FNR = \displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}
  • F1-Score: F1=2PRP+RF_1 = \displaystyle\frac{2P\,R}{P+R}
  • Average Precision (AP): Area under precision–recall curve.
  • Maximum Recall at 100% Precision (MR):

MR=maxτ{R(τ)P(τ)=1.0}\mathrm{MR} = \max_{\tau}\{R(\tau)\,|\,P(\tau) = 1.0\}

  • Inlier Ratio: IR=IM\mathrm{IR} = \displaystyle\frac{|\mathcal{I}|}{|\mathcal{M}|}, with M\mathcal{M} the tentative match set before RANSAC.

2.4 Feature Matching Algorithms and Comparative Results

Six representative local feature matching pipelines are compared (RANSAC-based verification is fixed):

Method Key technique
SIFT + NN Handcrafted DoG + SIFT, ratio test
SuperPoint + NN Jointly learned corner+descriptor, NN matching
SuperPoint + SuperGlue Learned matcher (GNN, SuperPoint descriptors)
DISK + NN Policy-gradient learned detector+desc., NN
DISK + LightGlue Learned LightGlue matcher, fast LNN
LoFTR Detector-free, end-to-end transformer
  • SuperPoint + SuperGlue achieves highest average MR: 44.4% across all sequences; best on 4/6.
  • LoFTR achieves highest average AP: 93.2% but substantially lower MR (26.6%). Performances across sequences show significant variability, with condition and viewpoint strongly impacting robustness (Yu et al., 2024).

2.5 Limitations and Prospective Directions

  • Feature robustness: Networks like SuperPoint and LoFTR are not trained on data with severe weather/seasonal change. Augmentation with long-term variability, and harder negatives, is recommended.
  • Multi-condition databases: Aliasing increases with severe condition gaps; storing multi-experience reference images can mitigate this.
  • Advanced outlier rejection: Vanilla RANSAC is vulnerable to outlier-dominated matches; alternatives like MAGSAC++ or differentiable RANSAC are highlighted.
  • Ambiguity in ground truth: GPS-based labeling may misclassify visually valid cases; task-aware, geometry-driven labeling is advised.
  • Beyond geometric verification: Combining geometric and learned verifiers (e.g., Doppelgangers CNN, SUE) promises more robust loop closure verification pipelines.

3. Significance and Comparative Context

Both instantiations of FGVeriBench address a core challenge in their respective fields: capturing gradations of correctness (for LLMs) or robustness (for visual SLAM) that are otherwise obscured under conventional binary evaluation. For factuality, FGVeriBench uniquely provides triplet-based, severity-ranked evaluation necessary for advanced model selection and preference optimization—a feature absent in prior datasets (Huang et al., 7 Jan 2026). In robotics, FGVeriBench (GV-Bench) directly benchmarks the geometric verification stage, facilitating fair, controlled comparison of feature matchers and establishing new baselines for recall under strict precision. These innovations enable meaningful progress in model development, rigorous benchmarking, and the identification of persistent bottlenecks (Yu et al., 2024).

4. References

  • "DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier" (Huang et al., 7 Jan 2026)
  • "GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection" (Yu et al., 2024)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FGVeriBench.