Create a Video View Paper

OBLIQ-Bench: When Retrievers Can't Find What LLMs Can Verify

OBLIQ-Bench exposes a critical gap in modern information retrieval: current systems fail to surface documents for queries based on implicit attributes, abstract relationships, or latent structure, even though reasoning language models can easily verify relevance once candidates are presented. This benchmark introduces oblique queries across five real-world tasks, revealing that dense retrievers, late-interaction models, and even agentic multi-hop systems achieve near-zero recall on queries defined by authorial style, behavioral patterns, or analogical reasoning, while verifier models reach near-perfect accuracy on the same tasks given proper candidate pools.

Script

Current retrieval systems cannot find documents that language models can instantly verify as relevant. This asymmetry reveals a fundamental bottleneck: when relevance depends on implicit attributes or abstract relationships rather than surface keywords, even state-of-the-art dense retrievers return nearly empty results.

OBLIQ-Bench introduces three categories of oblique queries that verification models solve but retrievers cannot. Descriptive queries target implicit behavioral features like ironic stance in tweets. Analogue queries seek abstract structural similarities, such as math problems sharing the same proof technique across different topics. Tip-of-the-tongue queries retrieve specific passages from partial, indirect recollections while deliberately omitting direct lexical clues.

The quantitative gap is stark. Across all five benchmark tasks, the best dense retrievers, late-interaction models, and agentic systems achieve NDCG at 10 below 0.22. The oracle verifier, a reasoning language model with proper candidate pools, reaches 0.33 to 0.91 on the same queries. On congressional hearing retrieval, classic and modern retrievers yield near-zero recall at 10 and 50 results, while the verifier approaches perfect recovery when golds are present.

The benchmark construction pipeline ensures robust relevance labeling for latent tasks. A human defines an implicit attribute, an Large Language Model annotates the entire corpus through that lens, clustering groups similar attribute values, then generates abstract queries while forbidding source vocabulary. This methodology enables large-scale, high-recall annotation where traditional pooling methods would fail.

No amount of query rewriting or multi-hop agent scaffolding closes the gap when relevance is defined by distributed or highly abstract meta-labels. This failure reveals that dual-encoder and late-interaction architectures are structurally incapable of exposing latent signals without full query-document context at retrieval time. Future systems must internalize attribute extraction and align corpus indexing with implicit signal access.

OBLIQ-Bench establishes that latent attribute retrieval is a rigorous and unsolved challenge, demanding architectural innovations beyond current parametric priors and surface decomposition. As reasoning models trivialize verification, the bottleneck shifts entirely to surfacing candidates defined by attributes that cannot be indexed through shallow correlation. You can explore this paper further and create your own research video at EmergentMind.com.