ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval

Published 13 Apr 2026 in cs.IR | (2604.11092v1)

Abstract: Neural retrievers are often trained on large-scale triplet data comprising a query, a positive passage, and a set of hard negatives. In practice, hard-negative mining can introduce false negatives and other ambiguous negatives, including passages that are relevant or contain partial answers to the query. Such label noise yields inconsistent supervision and can degrade retrieval effectiveness. We propose ARHN (Answer-centric Relabeling of Hard Negatives), a two-stage framework that leverages open-source LLMs to refine hard negative samples using answer-centric relevance signals. In the first stage, for each query-passage pair, ARHN prompts the LLM to generate a passage-grounded answer snippet or to indicate that the passage does not support an answer. In the second stage, ARHN applies an LLM-based listwise ranking over the candidate set to order passages by direct answerability to the query. Passages ranked above the original positive are relabeled to additional positives. Among passages ranked below the positive, ARHN excludes any that contain an answer snippet from the negative set to avoid ambiguous supervision. We evaluated ARHN on the BEIR benchmark under three configurations: relabeling only, filtering only, and their combination. Across datasets, the combined strategy consistently improves over either step in isolation, indicating that jointly relabeling false negatives and filtering ambiguous negatives yields cleaner supervision for training neural retrieval models. By relying strictly on open-source models, ARHN establishes a cost-effective and scalable refinement pipeline suitable for large-scale training.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces ARHN, which redefines hard-negative mining by extracting answer snippets to relabel or filter negatives for cleaner supervision.
It employs a two-stage pipeline that first extracts answer evidence with an open-source LLM and then reranks snippets to optimize label refinement.
Empirical results demonstrate improved nDCG@10 and OOD generalization across multiple datasets using scalable LLM-driven methods.

Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval

Motivation and Problem Setting

Training neural text retrievers for IR, RAG, and QA tasks typically employs contrastive learning with large triplet collections: a query, a positive passage, and multiple hard negative passages. While hard negatives (difficult non-relevant candidates) are intended to maximize discriminative signal, standard mining techniques frequently yield label contamination—passages annotated as negatives despite actually being relevant or partially answer-supporting. Such false negatives distort the representation geometry, degrade generalization, and are especially problematic for zero-shot and OOD retrieval due to compromised supervision.

The paper proposes ARHN (Answer-centric Relabeling of Hard Negatives), a systematic approach to mining- and label-noise correction that leverages open-source LLMs for answer-aware negative relabeling. ARHN operationalizes two answer-centric interventions: (1) relabeling answer-containing negatives as additional positives and (2) removing ambiguously relevant negatives to construct a cleaner supervision set.

Figure 1: Schematic motivation for ARHN—standard hard-negative mining introduces answer-bearing false negatives which ARHN either relabels as positives or removes as ambiguous negatives.

ARHN Methodology

The ARHN pipeline is two-stage and LLM-driven:

Stage 1—Answer Snippet Extraction: For each (query, passage) pair, an open-source LLM is prompted to identify a contiguous answer-supporting span as a verbatim snippet, or output NO_ANSWER if the passage lacks such evidence. This ensures that relabeling is grounded in explicit, document-derived support, reducing LLM hallucinations and focusing solely on extractive evidence.
Stage 2—Answer-Centric Listwise Reranking: All extracted answer snippets for a given query (including those from positives and hard negatives) are listwise reranked by the LLM according to directness and sufficiency of answer support. Negatives ranked above the original positive's answer are promoted to positives; those ranked below but still containing a snippet are filtered out as ambiguous. Only negatives with NO_ANSWER are retained as true denials.
Figure 2: End-to-end ARHN workflow—LLMs extract answer snippets or NO_ANSWER (Stage 1), after which snippets are listwise reranked and labels reconstructed (Stage 2).

This snippet-centric operationalization contrasts with prior passage-level relabeling (which is confounded by shared background and verbose context), yielding more focused and less ambiguous supervision correction.

Figure 3: LLM prompt design for snippet extraction and snippet-centric listwise reranking, facilitating interpretable answers and transparent passage assessment.

Empirical Evaluation

The authors benchmark ARHN on BEIR across 16 datasets, employing E5-base and LG-ANNA-Embedding (Mistral-7B) as retrievers. They compare three main variants:

Relabeling only: promote only,
Filtering only: remove ambiguous negatives,
Combined (R+F): both relabel false negatives and filter ambiguous negatives.

Key empirical findings:

On E5-base, the combined R+F strategy increases nDCG@10 from 0.508 to 0.521 (Avg. 16 datasets), outperforming no-refinement, prior relabeling (RLHN, 0.515), and advanced hard-negative mining baselines.
OOD (out-of-domain) generalization improves more substantially: nDCG@10 (Avg. 7 OOD datasets) increases from 0.425 to 0.446.
The improvements persist with an LLM-based embedder backbone, evidencing architecture-agnostic gains.
Larger open-source LLMs yield more reliable and stable refinements. With Qwen3-32B (versus Qwen3-8B or Qwen3-14B), refinement precision and nDCG@10 both improve, with smaller models frequently introducing spurious relabeling or over-filtering.
Figure 4: Retrieval quality (nDCG@10) monotonically improves as the refinement LLM scale increases, peaking with Qwen3-32B.

Cohen's $\kappa$ for LLM/human agreement on 500 relabeling cases increases with LLM scale (from 0.31 to 0.37), underscoring the importance of LLM capacity for supervision reliability.

Comparative and Qualitative Analysis

The snippet-centric protocol (ARHN) consistently outperforms passage-centric relabeling (PRHN), especially for OOD datasets and in settings where background context or question ambiguity leads passage-level signals astray. Answer-centric signals are critical for separating partial/ambiguous negatives from true negatives. Qualitative error analysis highlights frequent instances of positive-negative label contamination, often with negatives containing near-identical or highly complementary answer spans to those labeled as positives. ARHN directly mitigates these errors by promoting all answer-supporting evidentiary content to usable supervision and excising misleading negatives.

Practical and Theoretical Implications

Practically, ARHN synthesizes the advantages of answer grounding (extractive answer snippets, not just passage similarity) with cost-effective open-source LLMs, eliminating reliance on black-box commercial APIs. The pipeline is reproducible, high-throughput, and adaptable to arbitrary training sets and retrieval architectures.

Theoretically, the answer-centric paradigm reframes noisy contrastive learning from a passive data selection task to an evidence-centric, listwise signal reconstruction problem, leveraging evolving LLMs as scalable semantic annotators. By dissecting the hierarchy of snippet granularity, ARHN exposes the multifaceted nature of 'relevance' and the brittleness of purely instance-level annotation, especially in low-data or adversarial generalization regimes.

Future Directions

Open-source LLM scale remains a limiting factor. As open models continue to close the gap with commercial LLMs, further performance and consistency gains are expected.
There's potential to extend ARHN to non-extractive QA, multi-hop reasoning, and domains where full supervision transfer is intractable.
Integrating ARHN with dynamic data curation pipelines, continual learning, and in-deployment model monitoring can further tighten the data-model alignment loop.

Conclusion

ARHN delivers a robust framework for the correction of hard-negative label noise in dense text retrieval. By extracting answer snippets and leveraging listwise, answer-centric relabeling and filtering with open-source LLMs, it constructs higher-precision supervision signals, increases OOD generalization, and is largely backbone-agnostic. The results substantiate that principled answer-centric supervision—rather than heuristic instance-level mining—should form the foundation for future large-scale IR and RAG retrieval pipelines (2604.11092).

Markdown Report Issue