Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation (2502.19209v1)

Published 26 Feb 2025 in cs.CL

Abstract: Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in LLMs but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.

PDF Abstract

'{"REASONING": ["The DOCUMENT indicates that the ArrÃªt de MÃ©rindol was issued on January 1, 1545.", "The DOCUMENT also mentions that several villages were devastated, but it does not specify the exact timing of the devastation in relation to the ArrÃªt de MÃ©rindol.", "The QUESTION asks what happened second between the ArrÃªt de MÃ©rindol and the devastation of villages.", "The ANSWER provided is the devastation of villages.", "Since the DOCUMENT does not provide the exact timing of the devastation of villages, it cannot be concluded that it happened after the ArrÃªt de MÃ©rindol."], "SCORE": "FAIL"}', 'rejected': '{"REASONING": ["The DOCUMENT states that the French king, Francis I, issued the "ArrÃªt de MÃ©rindol" on January 1, 1545, and armed a crusade against the Waldensians of Provence.", "The DOCUMENT also mentions that deaths ranged from hundreds to thousands, and several villages were devastated.", "The QUESTION asks what happened second: ArrÃªt de MÃ©rindol or devastation of villages.", "The ANSWER provided is ArrÃªt de MÃ©rindol.", "Based on the DOCUMENT, the ArrÃªt de MÃ©rindol occurred before the devastation of villages."], "SCORE": "PASS"}'} \ & \multirow{15}{}{ZH} & {'input': 'ç»å®ä¸åâææ¡£âåâæè¦âï¼ä½ å¿é¡»åæææä¾çâæè¦âï¼å¹¶ç¡®å®å®æ¯å¦å¿ å®äºâææ¡£âçåå®¹ã\textbackslash{}n âæè¦âä¸å¾æä¾è¶åºâææ¡£âä¸æä¾çä¸ä¸æçæ°ä¿¡æ¯ãâæè¦âä¹ä¸å¾ä¸âææ¡£âä¸æä¾çä¿¡æ¯ç¸çç¾ãä¸¥æ ¼æç§ä»¥ä¸æ ¼å¼è¾åºä½ çæç»å¤æï¼å¦æâæè¦âå¿ å®äºâææ¡£âï¼åä¸º"éè¿" ï¼å¦ææè¦ä¸å¿ å®äºææ¡£ï¼åä¸º "å¤±è´¥"ã\textbackslash{}n --\textbackslash{}n ææ¡£:å®¢æ:è¯·é®ä¸æ¨æéå°ä»ä¹é®é¢éè¦æå¸®å©æ¨å¤çæèè§£å³çå¢?\textbackslash{}nç¨æ·:ææ³é®ä¸ä¸æçåç¥¨ä»ä¹æ¶åå¯åæ¥\textbackslash{}nå®¢æ:è¿ä¸ªè®¢åçµååç¥¨å·²ç»å¼å·äº\textbackslash{}nå®¢æ:PCç«¯:æçäº¬ä¸âå®¢æ·æå¡âæçåç¥¨âåç¥¨è¯¦æä¸è½½å³å¯;APPç«¯:æçâå®¢æ·æå¡âåç¥¨æå¡âåç¥¨è¯¦ææ¥çå³å¯\textbackslash{}nç¨æ·:æéè¦çº¸è´¨çä¸ç¥¨\textbackslash{}nç¨æ·:ä½ åç»æåä¸ä¸æçè®¢åï¼æç»ä½ éä¸ä¸\textbackslash{}nå®¢æ:[è®¢åç¼å·]æ¯è¿ä¸ªè®¢åå\textbackslash{}nç¨æ·:ä¸æ¯\textbackslash{}nç¨æ·:[è®¢åç¼å·]\textbackslash{}nç¨æ·:æ¯è¿ä¸ª\textbackslash{}nå®¢æ:[è®¢åç¼å·]åç¥¨çè¿åå·ï¼å·²ç»å¨ééä¸äº\textbackslash{}nç¨æ·:æè½æ¥ä¸ä¸å°åªäºå?\textbackslash{}nå®¢æ:[ç«ç¹]æ¨çè®¢åå¨äº¬ä¸[ç«ç¹]åè´§å®æï¼åå¤éå¾äº¬ä¸[ç«ç¹]\textbackslash{}nç¨æ·:å¯ï¼å¥½ç\textbackslash{}nç¨æ·:æå¤©è½å°å?\textbackslash{}nå®¢æ:å¦¹åä¹æ²¡æ³ä¿è¯ç\textbackslash{}nç¨æ·:å¯ï¼å¥½çï¼[å§å]\textbackslash{}nå®¢æ:äº²äº²ï¼å®¢æ°äºï¼å¦¹ååºè¯¥åç\textbackslash{}nå®¢æ:äº²ç±æ»´ï¼è¯·é®è¿æå¶ä»éè¦å¸®å©çä¹?\textbackslash{}nå®¢æ:å¦ææ²¡æçè¯ï¼è¿è¯·æ¨è®°å¾ç»[å§å]è¯ä»·åï¼åè´¢å°æç¹ä¸ç¹å¦\textbackslash{}n \textbackslash{}n--\textbackslash{}n æè¦:ç¨æ·è¯¢é®èªå·±çåç¥¨ä»ä¹æ¶é´è½å¤å¯åæ¥ï¼å¹¶è¯¢é®æ¯å¦æå¤©è½å¤å°è¾¾ãå®¢æåçç¨æ·çåç¥¨è¿åå·²ç»å¨äº¬ä¸[ç«ç¹]åè´§å®æï¼åå¤éå¾äº¬ä¸[ç«ç¹]ï¼ä½æ æ³ç»ç¨æ·ä¿è¯æ¯æå¤©å°è´§ã\textbackslash{}n --\textbackslash{}nä½ çè¾åºåºè¯¥æ¯ JSON æ ¼å¼çï¼é®æ¯"æ¨çè¿ç¨"å"å¤æ"ï¼{"æ¨çè¿ç¨":<ä½ å¦ä½æ¨çå¾å°å¤æç»æçæ¥éª¤ç¹>, "å¤æ":<ä½ çæç»å¤æ>}', 'chosen': '{"æ¨çè¿ç¨": ["ç¨æ·è¯¢é®åç¥¨ä»ä¹æ¶åå¯åæ¥ï¼ææ¡£ä¸å®¢æåççµååç¥¨å·²ç»å¼å·ï¼å¹¶æä¾äºä¸è½½æ¹å¼ã", "å®¢ææ¯å¦ä¿è¯åç¥¨æå¤©å°ï¼ææ¡£ä¸å®¢æè¡¨ç¤ºæ æ³ä¿è¯ã", "æè¦ç¼ºå°ç¨æ·éè¦çº¸è´¨çä¸ç¥¨ä»¥åå®¢ææä¾è®¢åç¼å·ä¿¡æ¯çéè¦ä¿¡æ¯ã", "æ¾ç¤ºæè¦ä¸å¤å¿ å®ã"], "å¤æ": "å¤±è´¥"}', 'rejected': '{"æ¨çè¿ç¨": ["ç¨æ·è¯¢é®åç¥¨ä»ä¹æ¶åå¯åæ¥ï¼ææ¡£ä¸å®¢æåççµååç¥¨å·²ç»å¼å·ï¼å¹¶æä¾äºä¸è½½æ¹å¼ã", "ç¨æ·è¯¢é®æ¯å¦æå¤©è½å°ï¼ææ¡£ä¸å®¢æè¡¨ç¤ºæ æ³ä¿è¯ã", "æè¦ä¸æå°çåå®¹ä¸ææ¡£ä¸çå¯¹è¯åŸºæ¬ä¸è´ï¼å¹¶ä¸ååã", "æè¦ç¼ºå°ç¨æ·éè¦çº¸è´¨çä¸ç¥¨ä»¥åå®¢ææä¾è®¢åç¼å·ä¿¡æ¯çéè¦ä¿¡æ¯ã", "æ¾ç¤ºæè¦ä¸å¤å¿ å®ã"], "å¤æ": "å¤±è´¥"}'} \ \ \bottomrule \end{tabular} } \caption{Examples of training dataset.} \label{tab:train_example} \end{table} \end{CJK*}

\subsection{Two-Stage Training Process} During the SFT stage, we use a learning rate of 1e-5, a batch size of 4, and train for 3 epochs. In the DPO stage, we set the beta value to 0.1, the learning rate to 5e-7, the batch size to 4, and train for 3 epochs. For the LoRA configuration, we set r=16, LoRA_alpha=32, and LoRA_dropout=0.05, and only fine-tune the Q, V, K, and O matrices.

\subsection{Inference Details} We use the same prompt templates as during training when conducting RAG hallucination detection inference. During inference, we set temperature=0.01 and top_p=0.1. Since the model's output is in JSON format, we use regular expressions for parsing. The computation is performed using a single Nvidia A100-80G GPU.

\section{Experiment Results} \label{app:exp} Table \ref{tab:exp_en} and Table \ref{tab:exp_zh} present the detailed experimental results for Bi'anBench_EN and Bi'anBench_ZH, respectively.

\begin{table*}[] \scalebox{0.75}{ \begin{tabular}{cccccccccc} Model & QA & GPT-4o-0806 & 86.6 HaluEval_qa & 83.8 RAGTruth_qa & 86.6 FinanceBench & 86.3 DROP & 86.5 CovidQA & 86.6 PubMedQA & 89.0 ASQA & 86.3 IfQA & 88.5 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - GPT-4o-mini & 82.9 HaluEval_qa & 78.2 RAGTruth_qa & 84.2 FinanceBench & 76.5 DROP & 85.5 CovidQA & 82.1 PubMedQA & 84.3 ASQA & 83.0 IfQA & 84.4 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Llama3.1-8B-Instruct & 72.3 HaluEval_qa & 71.6 RAGTruth_qa & 73.3 FinanceBench & 70.0 DROP & 74.1 CovidQA & 72.8 PubMedQA & 72.7 ASQA & 72.3 IfQA & 72.3 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Llama3.1-70B-Instruct & 83.2 HaluEval_qa & 81.9 RAGTruth_qa & 85.0 FinanceBench & 81.1 DROP & 83.7 CovidQA & 82.4 PubMedQA & 83.9 ASQA & 83.3 IfQA & 83.6 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Qwen2-7B-Instruct & 64.2 HaluEval_qa & 63.5 RAGTruth_qa & 64.9 FinanceBench & 61.2 DROP & 66.3 CovidQA & 62.6 PubMedQA & 65.5 ASQA & 64.0 IfQA & 64.8 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Qwen2-72B-Instruct & 82.7 HaluEval_qa & 81.7 RAGTruth_qa & 82.9 FinanceBench & 80.7 DROP & 84.1 CovidQA & 82.2 PubMedQA & 83.3 ASQA & 82.5 IfQA & 83.2 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Qwen2.5-7B-Instruct & 71.6 HaluEval_qa & 71.1 RAGTruth_qa & 72.2 FinanceBench & 68.7 DROP & 73.0 CovidQA & 70.1 PubMedQA & 72.5 ASQA & 71.7 IfQA & 72.0 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Qwen2.5-14B-Instruct & 79.8 HaluEval_qa & 79.1 RAGTruth_qa & 80.4 FinanceBench & 76.7 DROP & 81.3 CovidQA & 78.8 PubMedQA & 80.4 ASQA & 79.6 IfQA & 79.5 FIB & - HaluEval_sum & - WebNLG & - RAGTruth_d2t & - PDC & - WMT21 & - Qwen2.5-72B-Instruct & {\ul 85.7} HaluEval_qa & {\ul 84.9} RAGTruth_qa & {\ul 86.2} FinanceBench & 83.1 DROP & {\ul 86.4} CovidQA & 84.7 PubMedQA & {\ul 86.0} ASQA & {\ul 8 & \multicolumn{4}{c}{Bi'anBench_EN} & & & & & \ \cline{2-5} Summarization & Data-to-Text & Machine Translation & Avg. & & & & \ \hline & 75.5 & 85.6 & 86.4 & 84.8 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 75.5 & - & - & - & & & & \ & 75.5 & - & - & - & & & & \ & - & 85.6 & - & - & & & & \ & - & 85.5 & - & - & & & & \ & - & - & 86.5 & - & & & & \ & - & - & 86.4 & - & & & & \ \hline & 58.9 & 82.3 & 79.6 & 78.9 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 59.6 & - & - & - & & & & \ & 58.3 & - & - & - & & & & \ & - & 82.3 & - & - & & & & \ & - & 82.3 & - & - & & & & \ & - & - & 80.0 & - & & & & \ & - & - & 79.2 & - & & & & \ \hline & 60.2 & 62.6 & 68.3 & 68.6 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 60.7 & - & - & - & & & & \ & 59.7 & - & - & - & & & & \ & - & 62.6 & - & - & & & & \ & - & 62.6 & - & - & & & & \ & - & - & 67.7 & - & & & & \ & - & - & 68.9 & - & & & & \ \hline & 75.2 & 80.9 & 73.3 & 80.3 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 75.2 & - & - & - & & & & \ & 75.2 & - & - & - & & & & \ & - & 80.9 & - & - & & & & \ & - & 80.9 & - & - & & & & \ & - & - & 73.4 & - & & & & \ & - & - & 73.2 & - & & & & \ \hline & 56.8 & 66.4 & 74.8 & 64.9 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 56.7 & - & - & - & & & & \ & 56.9 & - & - & - & & & & \ & - & 66.4 & - & - & & & & \ & - & 66.4 & - & - & & & & \ & - & - & 74.9 & - & & & & \ & - & - & 74.7 & - & & & & \ \hline & 73.6 & 77.0 & 82.1 & 80.5 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 73.7 & - & - & - & & & & \ & 73.4 & - & - & - & & & & \ & - & 77.0 & - & - & & & & \ & - & 77.1 & - & - & & & & \ & - & - & 82.6 & - & & & & \ & - & - & 81.5 & - & & & & \ \hline & 66.1 & 72.8 & 80.9 & 72.3 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 66.7 & - & - & - & & & & \ & 65.4 & - & - & - & & & & \ & - & 72.8 & - & - & & & & \ & - & 72.8 & - & - & & & & \ & - & - & 80.6 & - & & & & \ & - & - & 81.2 & - & & & & \ \hline & 73.1 & 79.6 & 87.2 & 79.8 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & 73.6 & - & - & - & & & & \ & 72.5 & - & - & - & & & & \ & - & 79.6 & - & - & & & & \ & - & 79.6 & - & - & & & & \ & - & - & 86.8 & - & & & & \ & - & - & 87.6 & - & & & & \ \hline & {\ul 74.7} & 78.7 & 86.6 & 83.3 & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \ & - & - & - & - & & & & \