Irrelevant Query Neutrals (IQN) Overview

Updated 3 February 2026

Irrelevant Query Neutrals (IQN) are query or prompt instances distinguished by semantic irrelevance and annotation divergence, affecting model responses and retrieval performance.
IQN employs rigorous annotation protocols with multiple annotator inputs and agreement metrics like Fleiss’s κ to split true neutrals from conflicting cases.
In fields like IR and LLM evaluation, IQN informs synthetic query generation and statistical testing, leading to improved negative sampling and enhanced model robustness.

Irrelevant Query Neutrals (IQN) refer to categories or operationalizations of query, label, or prompt instances that are designated as "neutral" because semantically irrelevant variations, annotation divergences, or contextually orthogonal relationships lead to intermediate model behavior, annotation ambiguity, or indistinguishability in model response or retrieval performance. The IQN concept appears in several research threads in natural language inference (NLI), statistical hypothesis testing for generative models, and synthetic query generation for information retrieval (IR), with each area specifying the technical and empirical status of "irrelevant" or "neutral" cases via domain-appropriate protocols, agreement metrics, or test statistics (Nighojkar et al., 2023, Chaudhary et al., 2023, Acharyya et al., 13 Sep 2025).

1. Distinct Senses of Neutrality: Beyond the Catch-All

In classic NLI settings, the neutral label functions as a residual category in three-way classification (entailment, contradiction, neutral), meant to capture pairs neither entailed nor contradicted. However, empirical annotation analysis reveals that "neutral" conflates orthogonal phenomena:

True Neutral: Annotators unanimously agree that no relation can be inferred.
Conflicting Neutral (i.e., IQN in the NLI context): Annotators split between entailment and contradiction, leading the aggregate or majority label to default to "neutral" due to balancing, despite underlying disagreement.

Given $N$ annotations with counts $n_E, n_N, n_C$ , the agreement statistic $A = \max(n_E, n_N, n_C)/N$ isolates such cases. A "true neutral" arises if $n_N > n_E, n_N > n_C$ and $A=1.0$ ; a "conflicting neutral" is similarly defined but with $A<t$ for some threshold $t<1$ (Nighojkar et al., 2023). This taxonomy is critical because it exposes low construct validity in prior labeling protocols and motivates finer-grained annotation and model evaluation schemes.

2. Protocols and Metrics for Operationalizing IQN

Robust identification and use of IQN cases require explicit protocols:

Annotation Procedures: Collect $N \geq 5$ annotations per NLI item, record both categorical choice and justification — determining whether "neutral" reflects lack of evidence or evidence for opposing relations.
Agreement Metrics:
- Fleiss's $\kappa$ , Krippendorff's $\alpha$ provide reliability assessment, with standard cutoffs ( $\kappa, \alpha > 0.6$ ) indicating acceptable agreement.
Item Typing: Using the $A$ statistic and a configurable threshold $t$ (e.g. $t=0.8$ ), neutral-labeled items can be algorithmically split into true vs. conflicting (IQN) via per-item agreement.

This protocol is also relevant in IR and LLM prompt perturbation studies, where semantically irrelevant changes define family-level equivalency classes for statistical evaluation (Chaudhary et al., 2023, Acharyya et al., 13 Sep 2025).

3. IQN in Synthetic Query Generation and Information Retrieval

In the context of IR and zero-shot learning, the IQN concept is operationalized by constructing synthetic queries purposely labeled as irrelevant with respect to a given context:

Pairwise Generation: Rather than generating "irrelevant" queries ab initio, LLMs are conditioned on a previously generated relevant query for the same context, then prompted to create a distinct, non-relevant (neutral/irrelevant) query.
Modeling and Filtering: The approach involves a generation model $G(\cdot;\phi)$ (e.g., PaLM-2) and a filtering model $F(\cdot;\psi)$ to enforce label fidelity, with a downstream relevance scorer $R(q, d; \theta)$ optimized for cross-entropy on the synthetic triples.
Empirical Results: NDCG@10 gains of up to +7 over label-conditioned generation have been reported on BEIR and WANDS datasets, with pairwise generation producing harder negatives—i.e., on-topic but irrelevant queries that force finer decision boundaries (Chaudhary et al., 2023).

This demonstrates that the "relative" setup leverages the LLM's comparative reasoning, thus generating high-quality IQN instances and improving model robustness.

4. Statistical Testing with Irrelevant-Query-Neutral Perturbations

IQN acquires a formal statistical status in the evaluation of LLM response distribution invariance:

Definition and Null Construction: For a base query $q_0$ , the IQN set $\mathcal{Q}_0$ contains those queries differing by user-designated semantically irrelevant perturbations, forming the composite null $H_0: \min_{F \in \mathcal{F}_0} d(F_{q'}, F) = 0$ , for some metric $d$ .
Test Statistic and Estimation: In the Bernoulli response case, response distributions $F_q = \text{Bernoulli}(p_q)$ induce an interval $[a,b]$ in parameter space. The statistic $T_{m,r} = \min_{1 \leq j \leq m} | \hat{p}_j - \hat{p}' |$ is compared to a rejection threshold $\epsilon$ .
Sampling Strategy: The protocol divides the budget $\nu$ between pilot estimation of $[a,b]$ and optimized main-stage sampling, balancing Type I error, power, and computational cost. Asymptotic validity and consistency are established under independence and uniformity assumptions (Acharyya et al., 13 Sep 2025).

Practical recommendations emphasize explicit definition of $\mathcal{Q}_0$ , pilot estimation, and bootstrapping for error control, ensuring that only truly relevant distributional changes trigger rejection while suppressing false positives due to harmless, "neutral" prompt perturbations.

5. Empirical Evidence and Benchmark Construction

Empirical studies reinforce the necessity and validity of IQN distinctions:

NLI Models: Experimental separation of true vs. conflicting neutrals in SNLI, MultiNLI produces consistently higher macro F $_1$ vs. disagreement-based schemes, indicating that explicit acknowledgment of IQN leads to more ecologically valid benchmarks and more discriminating models. For instance, "Neu" labeling (true/conflicting neutral separation) outperforms "Dis" labeling (catch-all disagreement) across multiple transformer architectures (Nighojkar et al., 2023).
Failure of Continuous Scales: UNLI's continuous-scale operationalization fails to resolve the IQN distinction, as neural and conflicting cases collapse into bi-modal distributions indexed by slider position, undermining construct validity and transfer performance.
IR and LLMs: Pairwise synthetic IQN generation outperforms baseline (retriever-negative and label-conditioned) QGen approaches on NDCG metrics for IR, establishing the practical value of model-ready IQN distinctions (Chaudhary et al., 2023).

6. Recommendations for Framework Adoption and Future Directions

To integrate IQN in NLI, IR, or LLM evaluation pipelines:

Annotation Transparency: Retain full per-annotator judgment sets to enable explicit IQN classification and post-hoc analysis.
Explicit Instructions: Instruct annotators and models to distinguish between evidence-absence "neutral" and evidence-conflict (IQN).
Thresholded Labeling: Use agreement statistics and rational thresholds for item typing, avoiding aggregate catch-all neutral labels.
Comprehensive Evaluation: Report reliability coefficients (e.g., Fleiss's $\kappa$ ), per-class or per-subtype metrics (accuracy, F $_1$ ), and full annotator distribution recovery (e.g., via Jensen–Shannon divergence).
Retention of Ambiguity: Rather than discarding low-agreement or ambiguous (IQN) cases, include and label them to facilitate modeling of genuine uncertainty or conflicting evidence.
Scale Validation: For continuous judgments, ensure each point is interpretable and empirically tested to obviate mid-scale confusion or construct artifacts.

By adopting these protocols, benchmark designers and model evaluators will preserve the full nuance of human and model indeterminacy, avoid artificial inflation or deflation of model performance via category collapse, and support statistically principled inferences about model invariance under semantically irrelevant or neutral perturbations (Nighojkar et al., 2023, Chaudhary et al., 2023, Acharyya et al., 13 Sep 2025).