Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation (2502.19209v1)

Published 26 Feb 2025 in cs.CL
Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

Abstract: Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in LLMs but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.

'{"REASONING": ["The DOCUMENT indicates that the Arrêt de Mérindol was issued on January 1, 1545.", "The DOCUMENT also mentions that several villages were devastated, but it does not specify the exact timing of the devastation in relation to the Arrêt de Mérindol.", "The QUESTION asks what happened second between the Arrêt de Mérindol and the devastation of villages.", "The ANSWER provided is the devastation of villages.", "Since the DOCUMENT does not provide the exact timing of the devastation of villages, it cannot be concluded that it happened after the Arrêt de Mérindol."], "SCORE": "FAIL"}', 'rejected': '{"REASONING": ["The DOCUMENT states that the French king, Francis I, issued the "Arrêt de Mérindol" on January 1, 1545, and armed a crusade against the Waldensians of Provence.", "The DOCUMENT also mentions that deaths ranged from hundreds to thousands, and several villages were devastated.", "The QUESTION asks what happened second: Arrêt de Mérindol or devastation of villages.", "The ANSWER provided is Arrêt de Mérindol.", "Based on the DOCUMENT, the Arrêt de Mérindol occurred before the devastation of villages."], "SCORE": "PASS"}'} \ & \multirow{15}{}{ZH} & {'input': 'ç»™å®šä¸‹åˆ—â€œæ–‡æ¡£â€å’Œâ€œæ‘˜è¦â€ï¼Œä½ å¿…é¡»åˆ†æžæ‰€æä¾›çš„â€œæ‘˜è¦â€ï¼Œå¹¶ç¡®å®šå®ƒæ˜¯å¦å¿ å®žäºŽâ€œæ–‡æ¡£â€çš„å†…å®¹ã€‚\textbackslash{}n â€œæ‘˜è¦â€ä¸å¾—æä¾›è¶…å‡ºâ€œæ–‡æ¡£â€ä¸­æä¾›çš„ä¸Šä¸‹æ–‡çš„æ–°ä¿¡æ¯ã€‚â€œæ‘˜è¦â€ä¹Ÿä¸å¾—ä¸Žâ€œæ–‡æ¡£â€ä¸­æä¾›çš„ä¿¡æ¯ç›¸çŸ›ç›¾ã€‚ä¸¥æ ¼æŒ‰ç…§ä»¥ä¸‹æ ¼å¼è¾“å‡ºä½ çš„æœ€ç»ˆåˆ¤æ–­ï¼šå¦‚æžœâ€œæ‘˜è¦â€å¿ å®žäºŽâ€œæ–‡æ¡£â€ï¼Œåˆ™ä¸º"通过" ï¼›å¦‚æžœæ‘˜è¦ä¸å¿ å®žäºŽæ–‡æ¡£ï¼Œåˆ™ä¸º "失败"。\textbackslash{}n --\textbackslash{}n 文档:客服:请问下您有遇到什么问题需要我帮助您处理或者解决的呢?\textbackslash{}n用户:我想问一下我的发票什么时候寄回来\textbackslash{}n客服:这个订单电子发票已经开具了\textbackslash{}n客服:PC端:我的京东—客户服务—我的发票—发票详情下载即可;APP端:我的—客户服务—发票服务—发票详情查看即可\textbackslash{}n用户:我需要纸质的专票\textbackslash{}n用户:ä½ å†ç»™æˆ‘å‘ä¸€ä¸‹æˆ‘çš„è®¢å•ï¼Œæˆ‘ç»™ä½ é€‰ä¸€ä¸‹\textbackslash{}n客服:[订单编号]是这个订单吗\textbackslash{}n用户:不是\textbackslash{}n用户:[订单编号]\textbackslash{}n用户:是这个\textbackslash{}n客服:[订单编号]发票的运单号,已经在配送中了\textbackslash{}n用户:我能查一下到哪了吗?\textbackslash{}n客服:[站点]您的订单在京东[站点]发货完成,准备送往京东[站点]\textbackslash{}n用户:嗯,好的\textbackslash{}n用户:明天能到吗?\textbackslash{}n客服:妹子也没法保证的\textbackslash{}n用户:嗯,好的,[姓名]\textbackslash{}n客服:亲亲,客气了,妹子应该做的\textbackslash{}n客服:亲爱滴,请问还有其他需要帮助的么?\textbackslash{}n客服:如果没有的话,还请您记得给[姓名]评价哟,发财小手点一点哦\textbackslash{}n \textbackslash{}n--\textbackslash{}n 摘要:用户询问自己的发票什么时间能够寄回来,并询问是否明天能够到达。客服回答用户的发票运单已经在京东[站点]发货完成,准备送往京东[站点]ï¼Œä½†æ— æ³•ç»™ç”¨æˆ·ä¿è¯æ˜¯æ˜Žå¤©åˆ°è´§ã€‚\textbackslash{}n --\textbackslash{}nä½ çš„è¾“å‡ºåº”è¯¥æ˜¯ JSON æ ¼å¼çš„ï¼Œé”®æ˜¯"推理过程"和"判断":{"推理过程":<ä½ å¦‚ä½•æŽ¨ç†å¾—åˆ°åˆ¤æ–­ç»“æžœçš„æ­¥éª¤ç‚¹>, "判断":<ä½ çš„æœ€ç»ˆåˆ¤æ–­>}', 'chosen': '{"推理过程": ["用户询问发票什么时候寄回来,文档中客服回答电子发票已经开具,并提供了下载方式。", "å®¢æœæ˜¯å¦ä¿è¯å‘ç¥¨æ˜Žå¤©åˆ°ï¼Œæ–‡æ¡£ä¸­å®¢æœè¡¨ç¤ºæ— æ³•ä¿è¯ã€‚", "摘要缺少用户需要纸质的专票以及客服提供订单编号信息等重要信息。", "æ˜¾ç¤ºæ‘˜è¦ä¸å¤Ÿå¿ å®žã€‚"], "判断": "失败"}', 'rejected': '{"推理过程": ["用户询问发票什么时候寄回来,文档中客服回答电子发票已经开具,并提供了下载方式。", "ç”¨æˆ·è¯¢é—®æ˜¯å¦æ˜Žå¤©èƒ½åˆ°ï¼Œæ–‡æ¡£ä¸­å®¢æœè¡¨ç¤ºæ— æ³•ä¿è¯ã€‚", "摘要中提到的内容与文档中的对话基本一致,并不充分。", "摘要缺少用户需要纸质的专票以及客服提供订单编号信息等重要信息。", "æ˜¾ç¤ºæ‘˜è¦ä¸å¤Ÿå¿ å®žã€‚"], "判断": "失败"}'} \ \ \bottomrule \end{tabular} } \caption{Examples of training dataset.} \label{tab:train_example} \end{table} \end{CJK*}

\subsection{Two-Stage Training Process} During the SFT stage, we use a learning rate of 1e-5, a batch size of 4, and train for 3 epochs. In the DPO stage, we set the beta value to 0.1, the learning rate to 5e-7, the batch size to 4, and train for 3 epochs. For the LoRA configuration, we set r=16, LoRA_alpha=32, and LoRA_dropout=0.05, and only fine-tune the Q, V, K, and O matrices.

\subsection{Inference Details} We use the same prompt templates as during training when conducting RAG hallucination detection inference. During inference, we set temperature=0.01 and top_p=0.1. Since the model's output is in JSON format, we use regular expressions for parsing. The computation is performed using a single Nvidia A100-80G GPU.

\section{Experiment Results} \label{app:exp} Table \ref{tab:exp_en} and Table \ref{tab:exp_zh} present the detailed experimental results for Bi'anBench_EN and Bi'anBench_ZH, respectively.

\begin{table*}[] \scalebox{0.75}{ \begin{tabular}{cccccccccc} Model & \multicolumn{4}{c}{Bi'anBench_EN} & & & & & \ \cline{2-5} & QA & Summarization & Data-to-Text & Machine Translation & Avg. & & & & \ \hline GPT-4o-0806 & 86.6 & 75.5 & 85.6 & 86.4 & 84.8 & & & & \ HaluEval_qa & 83.8 & - & - & - & - & & & & \ RAGTruth_qa & 86.6 & - & - & - & - & & & & \ FinanceBench & 86.3 & - & - & - & - & & & & \ DROP & 86.5 & - & - & - & - & & & & \ CovidQA & 86.6 & - & - & - & - & & & & \ PubMedQA & 89.0 & - & - & - & - & & & & \ ASQA & 86.3 & - & - & - & - & & & & \ IfQA & 88.5 & - & - & - & - & & & & \ FIB & - & 75.5 & - & - & - & & & & \ HaluEval_sum & - & 75.5 & - & - & - & & & & \ WebNLG & - & - & 85.6 & - & - & & & & \ RAGTruth_d2t & - & - & 85.5 & - & - & & & & \ PDC & - & - & - & 86.5 & - & & & & \ WMT21 & - & - & - & 86.4 & - & & & & \ \hline GPT-4o-mini & 82.9 & 58.9 & 82.3 & 79.6 & 78.9 & & & & \ HaluEval_qa & 78.2 & - & - & - & - & & & & \ RAGTruth_qa & 84.2 & - & - & - & - & & & & \ FinanceBench & 76.5 & - & - & - & - & & & & \ DROP & 85.5 & - & - & - & - & & & & \ CovidQA & 82.1 & - & - & - & - & & & & \ PubMedQA & 84.3 & - & - & - & - & & & & \ ASQA & 83.0 & - & - & - & - & & & & \ IfQA & 84.4 & - & - & - & - & & & & \ FIB & - & 59.6 & - & - & - & & & & \ HaluEval_sum & - & 58.3 & - & - & - & & & & \ WebNLG & - & - & 82.3 & - & - & & & & \ RAGTruth_d2t & - & - & 82.3 & - & - & & & & \ PDC & - & - & - & 80.0 & - & & & & \ WMT21 & - & - & - & 79.2 & - & & & & \ \hline Llama3.1-8B-Instruct & 72.3 & 60.2 & 62.6 & 68.3 & 68.6 & & & & \ HaluEval_qa & 71.6 & - & - & - & - & & & & \ RAGTruth_qa & 73.3 & - & - & - & - & & & & \ FinanceBench & 70.0 & - & - & - & - & & & & \ DROP & 74.1 & - & - & - & - & & & & \ CovidQA & 72.8 & - & - & - & - & & & & \ PubMedQA & 72.7 & - & - & - & - & & & & \ ASQA & 72.3 & - & - & - & - & & & & \ IfQA & 72.3 & - & - & - & - & & & & \ FIB & - & 60.7 & - & - & - & & & & \ HaluEval_sum & - & 59.7 & - & - & - & & & & \ WebNLG & - & - & 62.6 & - & - & & & & \ RAGTruth_d2t & - & - & 62.6 & - & - & & & & \ PDC & - & - & - & 67.7 & - & & & & \ WMT21 & - & - & - & 68.9 & - & & & & \ \hline Llama3.1-70B-Instruct & 83.2 & 75.2 & 80.9 & 73.3 & 80.3 & & & & \ HaluEval_qa & 81.9 & - & - & - & - & & & & \ RAGTruth_qa & 85.0 & - & - & - & - & & & & \ FinanceBench & 81.1 & - & - & - & - & & & & \ DROP & 83.7 & - & - & - & - & & & & \ CovidQA & 82.4 & - & - & - & - & & & & \ PubMedQA & 83.9 & - & - & - & - & & & & \ ASQA & 83.3 & - & - & - & - & & & & \ IfQA & 83.6 & - & - & - & - & & & & \ FIB & - & 75.2 & - & - & - & & & & \ HaluEval_sum & - & 75.2 & - & - & - & & & & \ WebNLG & - & - & 80.9 & - & - & & & & \ RAGTruth_d2t & - & - & 80.9 & - & - & & & & \ PDC & - & - & - & 73.4 & - & & & & \ WMT21 & - & - & - & 73.2 & - & & & & \ \hline Qwen2-7B-Instruct & 64.2 & 56.8 & 66.4 & 74.8 & 64.9 & & & & \ HaluEval_qa & 63.5 & - & - & - & - & & & & \ RAGTruth_qa & 64.9 & - & - & - & - & & & & \ FinanceBench & 61.2 & - & - & - & - & & & & \ DROP & 66.3 & - & - & - & - & & & & \ CovidQA & 62.6 & - & - & - & - & & & & \ PubMedQA & 65.5 & - & - & - & - & & & & \ ASQA & 64.0 & - & - & - & - & & & & \ IfQA & 64.8 & - & - & - & - & & & & \ FIB & - & 56.7 & - & - & - & & & & \ HaluEval_sum & - & 56.9 & - & - & - & & & & \ WebNLG & - & - & 66.4 & - & - & & & & \ RAGTruth_d2t & - & - & 66.4 & - & - & & & & \ PDC & - & - & - & 74.9 & - & & & & \ WMT21 & - & - & - & 74.7 & - & & & & \ \hline Qwen2-72B-Instruct & 82.7 & 73.6 & 77.0 & 82.1 & 80.5 & & & & \ HaluEval_qa & 81.7 & - & - & - & - & & & & \ RAGTruth_qa & 82.9 & - & - & - & - & & & & \ FinanceBench & 80.7 & - & - & - & - & & & & \ DROP & 84.1 & - & - & - & - & & & & \ CovidQA & 82.2 & - & - & - & - & & & & \ PubMedQA & 83.3 & - & - & - & - & & & & \ ASQA & 82.5 & - & - & - & - & & & & \ IfQA & 83.2 & - & - & - & - & & & & \ FIB & - & 73.7 & - & - & - & & & & \ HaluEval_sum & - & 73.4 & - & - & - & & & & \ WebNLG & - & - & 77.0 & - & - & & & & \ RAGTruth_d2t & - & - & 77.1 & - & - & & & & \ PDC & - & - & - & 82.6 & - & & & & \ WMT21 & - & - & - & 81.5 & - & & & & \ \hline Qwen2.5-7B-Instruct & 71.6 & 66.1 & 72.8 & 80.9 & 72.3 & & & & \ HaluEval_qa & 71.1 & - & - & - & - & & & & \ RAGTruth_qa & 72.2 & - & - & - & - & & & & \ FinanceBench & 68.7 & - & - & - & - & & & & \ DROP & 73.0 & - & - & - & - & & & & \ CovidQA & 70.1 & - & - & - & - & & & & \ PubMedQA & 72.5 & - & - & - & - & & & & \ ASQA & 71.7 & - & - & - & - & & & & \ IfQA & 72.0 & - & - & - & - & & & & \ FIB & - & 66.7 & - & - & - & & & & \ HaluEval_sum & - & 65.4 & - & - & - & & & & \ WebNLG & - & - & 72.8 & - & - & & & & \ RAGTruth_d2t & - & - & 72.8 & - & - & & & & \ PDC & - & - & - & 80.6 & - & & & & \ WMT21 & - & - & - & 81.2 & - & & & & \ \hline Qwen2.5-14B-Instruct & 79.8 & 73.1 & 79.6 & 87.2 & 79.8 & & & & \ HaluEval_qa & 79.1 & - & - & - & - & & & & \ RAGTruth_qa & 80.4 & - & - & - & - & & & & \ FinanceBench & 76.7 & - & - & - & - & & & & \ DROP & 81.3 & - & - & - & - & & & & \ CovidQA & 78.8 & - & - & - & - & & & & \ PubMedQA & 80.4 & - & - & - & - & & & & \ ASQA & 79.6 & - & - & - & - & & & & \ IfQA & 79.5 & - & - & - & - & & & & \ FIB & - & 73.6 & - & - & - & & & & \ HaluEval_sum & - & 72.5 & - & - & - & & & & \ WebNLG & - & - & 79.6 & - & - & & & & \ RAGTruth_d2t & - & - & 79.6 & - & - & & & & \ PDC & - & - & - & 86.8 & - & & & & \ WMT21 & - & - & - & 87.6 & - & & & & \ \hline Qwen2.5-72B-Instruct & {\ul 85.7} & {\ul 74.7} & 78.7 & 86.6 & 83.3 & & & & \ HaluEval_qa & {\ul 84.9} & - & - & - & - & & & & \ RAGTruth_qa & {\ul 86.2} & - & - & - & - & & & & \ FinanceBench & 83.1 & - & - & - & - & & & & \ DROP & {\ul 86.4} & - & - & - & - & & & & \ CovidQA & 84.7 & - & - & - & - & & & & \ PubMedQA & {\ul 86.0} & - & - & - & - & & & & \ ASQA & {\ul 8

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhouyu Jiang (4 papers)
  2. Mengshu Sun (41 papers)
  3. Zhiqiang Zhang (129 papers)
  4. Lei Liang (37 papers)