FGVeriBench: Fine-Grained Verification Benchmarks

Updated 14 February 2026

FGVeriBench is a dual-domain benchmark offering fine-grained evaluation for both LLM factuality and geometric verification, enhancing nuance in model assessment.
It employs a triplet-based ranking system categorizing outputs as correct, intermediate, or incorrect to enable detailed performance analysis.
The benchmark integrates systematic dataset construction, robust annotation pipelines, and rigorous metrics for both textual and robotic applications.

FGVeriBench is a term used for two distinct high-quality open benchmarks targeting fine-grained verification in separate domains: (1) factuality in LLM question answering and (2) geometric verification in long-term visual loop closure detection for robotics. Both benchmarks address critical limitations in the granularity of evaluation within their domains by introducing systematically constructed datasets, robust annotation pipelines, and rigorous evaluation metrics. The following entry separately describes these benchmarks as established in recent arXiv literature.

1. Fine-grained Factuality Verification Benchmark for LLMs

1.1 Motivation and Distinction from Previous Work

Existing LLM factuality benchmarks (notably TruthfulQA, HaluEval, FELM, WiCE) primarily adopt binary (correct/incorrect) or sentence-level judgments, failing to discriminate between varying degrees of incorrectness. This severely limits their suitability for nuanced model evaluation, preference optimization in reinforcement learning from human feedback, and generation filtering. FGVeriBench is introduced as the first ranking-based, fine-grained testbed for factuality verification on short question–answer (QA) pairs. Unlike predecessors, it provides explicit annotation of error severity: answers are labeled as "correct," "intermediate" (minor error), or "incorrect" (major factual error or hallucination), supporting nuanced model development and comparative analysis (Huang et al., 7 Jan 2026).

1.2 Dataset Construction Pipeline and Properties

Data Sources and QA Types

FGVeriBench draws general QA questions (single-hop) from Natural Questions (NQ), TriviaQA, and PopQA, as well as multi-hop QA (requiring chained reasoning) from HotpotQA, MuSique, and 2WikiMultihop. This diversification ensures comprehensive coverage of both elementary and compositional reasoning challenges.

Annotation Protocol

Instruction Filtering: Questions are selected if they possess a single, unambiguous ground-truth answer.
Answer Pool Generation: Multiple strong LLMs generate diverse candidate answers for each question.
Initial LLM-Judge Classification: An LLM judge assigns each candidate to one of three classes: Correct, Intermediate, or Incorrect.
Triplet Sampling: For each question, one representative of each LLM-labeled category is randomly selected to form a triplet.
LLM + Human Verification: Three human annotators assess and order the triplets. Triplets lacking at least two annotators in agreement with the LLM are discarded.

Label Taxonomy

Rank	Description
1	Correct—fully factual, no errors
2	Intermediate—minor factual inaccuracy or omission
3	Incorrect—major factual error or hallucination

Cohen's κ and observed inter-annotator agreement (64.8%) are reported for reliability analysis.

Dataset Statistics

Domain	Questions	Avg. Answer Length (tokens)
NQ	203	15.7
TriviaQA	372	13.7
PopQA	365	14.3
HotpotQA	287	16.1
MuSique	154	21.0
2Wiki	298	14.9

Each of the 1,679 triplets contains one answer per rank.

1.3 Benchmark Format and Example Structure

Each dataset record is represented as a JSON object holding the QA instruction, a ground-truth answer (reference), and three candidate answers, each labeled with its rank. Example:

{
  "instruction": "What nationality is Alice Delamar’s father?",
  "reference": "United States",
  "candidates": [
    {"rank":1, "text":"American"},
    {"rank":2, "text":"I don't have information on Alice Delamar."},
    {"rank":3, "text":"Alice Delamar's father’s nationality is French."}
  ]
}

1.4 Evaluation Protocol and Metrics

Ranking Metrics

Precision@1 (P@1): Proportion of questions where the model’s top choice aligns with human Rank 1.
Kendall’s τ: Measures concordance between model and gold standard triplet ranking; for $k=3$ candidates, $\tau = (C - D) / [{1 \over 2} k (k-1)]$ where $C$ and $D$ are concordant and discordant pairs.

Binary Adaptation

By merging Rank 2 and 3, standard binary classification metrics (Accuracy, Precision, Recall, F1) can be computed.

Weighted Scoring (Proposed)

A severity-weighted average,

$S = \frac{1}{N}\sum_{i}\sum_{c} w_c\cdot[\text{pred}_i = c]$

where $w_1=1.0$ (correct), $w_2=\alpha$ (minor inaccuracy), $w_3=0.0$ , $\alpha\in(0,1)$ .

1.5 Comparative Analysis with Existing Benchmarks

Feature	TruthfulQA/HaluEval/FELM/WiCE	FGVeriBench
Annotation Lvl	Binary (sentence/claim)	Triplet ranking (answer-level)
Scope	Hallucination/summarization	Short QA, single/multi-hop
Granularity	2-way	3-way (fine-grained)
Protocol	Pointwise	Pairwise/ranking

FGVeriBench’s triplet design ensures each question features a "best," "medium," and "worst" answer, providing a robust scaffold for severity-aware ranking and learning (Huang et al., 7 Jan 2026).

2. Geometric Verification Benchmark for Visual Loop Closure Detection

2.1 Purpose and Context

In visual SLAM, loop closure verification is essential to reject erroneous matches and enforce global map consistency. While most benchmarks target image retrieval, pose estimation, or homography, FGVeriBench (also termed GV-Bench) systematically benchmarks the geometric verification stage, where local feature matching and epipolar constraints decide the veracity of loop closure candidates under challenging appearance and viewpoint changes (Yu et al., 2024).

2.2 Dataset Sequences and Environmental Variations

FGVeriBench constructs six image-pair sequences from three public datasets, each covering a distinct combination of environmental condition and viewpoint:

Sequence	Dataset/Condition	Viewpoint
Day	Oxford RobotCar / Day	Moderate
Night	Oxford RobotCar / Night	Moderate, illumination diff
Season	RobotCar (Autumn-Winter)	Moderate, seasonal diff
Weather	RobotCar (Sun-Rain)	Moderate, weather diff
Nordland	Train video, seasons	Negligible, fixed camera
UAcampus	Campus (day-night)	Strong, path variation

Each sequence comprises approximately 1,400–1,700 queries, with each query matched to top 10–20 database images, generating tens of thousands of evaluation pairs.

2.3 Benchmark Task, Pipeline, and Evaluation Protocol

Loop Closure Verification Pipeline

Keypoint Extraction and Description: Algorithms such as SIFT, SuperPoint, and DISK extract and describe local features.
Feature Matching: Nearest-neighbor or graph neural matcher (e.g., SuperGlue, LightGlue) associates keypoints.
Geometric Verification: RANSAC estimates the fundamental matrix $\mathbf{F}$ enforcing epipolar geometry $\mathbf{x}_2^T\mathbf{F}\mathbf{x}_1=0$ . The set of inliers $\mathcal{I}$ is extracted.
Binary Decision: Any $(q_i, r_{i,j})$ is declared a true loop if $|\mathcal{I}|\geq \tau_{\textrm{inlier}}$ .

Evaluation Metrics

Precision: $P = \displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$
Recall: $R = \displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$
F1-Score: $F_1 = \displaystyle\frac{2P\,R}{P+R}$
Average Precision (AP): Area under precision–recall curve.
Maximum Recall at 100% Precision (MR):

$\mathrm{MR} = \max_{\tau}\{R(\tau)\,|\,P(\tau) = 1.0\}$

Inlier Ratio: $\mathrm{IR} = \displaystyle\frac{|\mathcal{I}|}{|\mathcal{M}|}$ , with $\mathcal{M}$ the tentative match set before RANSAC.

2.4 Feature Matching Algorithms and Comparative Results

Six representative local feature matching pipelines are compared (RANSAC-based verification is fixed):

Method	Key technique
SIFT + NN	Handcrafted DoG + SIFT, ratio test
SuperPoint + NN	Jointly learned corner+descriptor, NN matching
SuperPoint + SuperGlue	Learned matcher (GNN, SuperPoint descriptors)
DISK + NN	Policy-gradient learned detector+desc., NN
DISK + LightGlue	Learned LightGlue matcher, fast LNN
LoFTR	Detector-free, end-to-end transformer

SuperPoint + SuperGlue achieves highest average MR: 44.4% across all sequences; best on 4/6.
LoFTR achieves highest average AP: 93.2% but substantially lower MR (26.6%). Performances across sequences show significant variability, with condition and viewpoint strongly impacting robustness (Yu et al., 2024).

2.5 Limitations and Prospective Directions

Feature robustness: Networks like SuperPoint and LoFTR are not trained on data with severe weather/seasonal change. Augmentation with long-term variability, and harder negatives, is recommended.
Multi-condition databases: Aliasing increases with severe condition gaps; storing multi-experience reference images can mitigate this.
Advanced outlier rejection: Vanilla RANSAC is vulnerable to outlier-dominated matches; alternatives like MAGSAC++ or differentiable RANSAC are highlighted.
Ambiguity in ground truth: GPS-based labeling may misclassify visually valid cases; task-aware, geometry-driven labeling is advised.
Beyond geometric verification: Combining geometric and learned verifiers (e.g., Doppelgangers CNN, SUE) promises more robust loop closure verification pipelines.

3. Significance and Comparative Context

Both instantiations of FGVeriBench address a core challenge in their respective fields: capturing gradations of correctness (for LLMs) or robustness (for visual SLAM) that are otherwise obscured under conventional binary evaluation. For factuality, FGVeriBench uniquely provides triplet-based, severity-ranked evaluation necessary for advanced model selection and preference optimization—a feature absent in prior datasets (Huang et al., 7 Jan 2026). In robotics, FGVeriBench (GV-Bench) directly benchmarks the geometric verification stage, facilitating fair, controlled comparison of feature matchers and establishing new baselines for recall under strict precision. These innovations enable meaningful progress in model development, rigorous benchmarking, and the identification of persistent bottlenecks (Yu et al., 2024).

4. References

"DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier" (Huang et al., 7 Jan 2026)
"GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection" (Yu et al., 2024)

Markdown Upgrade to Chat

References (2)

DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier (2026)

GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FGVeriBench.