Determine the hardest qrel matrix under sign-rank-based limits

Determine, for fixed numbers of documents and top-k relevant documents per query, which binary query relevance (qrel) matrix is provably the hardest to represent for single-vector embedding models—formally, which matrices require the largest embedding dimension or maximize sign-rank—and provide a definitive theoretical proof identifying such matrices.

Background

In constructing the LIMIT dataset, the authors aimed to stress-test single-vector models by instantiating qrel matrices that are hard to represent. They conjecture that more interconnected (denser) qrel patterns are harder but acknowledge the difficulty of proving hardness due to the notorious difficulty of establishing sign-rank for specific matrices.

They explicitly state they could not prove which qrel matrix is hardest, leaving a concrete theoretical identification and proof as an open task.

References

Although we could not prove the hardest qrel matrix definitively with theory (as the sign rank is notoriously hard to prove), we speculate based on intuition that our theoretical results imply that the more interconnected the qrel matrix is (e.g. dense with all combinations) the harder it would be for models to represent.

— On the Theoretical Limitations of Embedding-Based Retrieval (2508.21038 - Weller et al., 28 Aug 2025) in Section: The LIMIT Dataset, Dataset Construction

Determine the hardest qrel matrix under sign-rank-based limits

Sponsor

Background

References

Related Problems