TREC 2025 ToT Track Overview
- TREC 2025 ToT Track is a specialized IR challenge focused on retrieving entities from verbose, episodic, and uncertain queries when canonical identifiers are missing.
- It employs diverse methodologies, including sparse, dense, and hybrid retrieval models along with LLM-generated synthetic queries for system evaluation.
- The track highlights key challenges such as handling false memories, hedging, and complex relational clues, paving the way for advanced query understanding.
Tip-of-the-tongue (ToT) retrieval is a specialized information retrieval (IR) challenge that focuses on re-finding entities when the searcher cannot supply a reliable identifier. The TREC 2025 Tip-of-the-Tongue (ToT) track systematically evaluated retrieval systems on this task, expanding into general-domain IR and introducing a spectrum of human and LLM-generated queries. The task emphasizes verbose, uncertain, and error-prone queries, which strain the assumptions of standard sparse and dense retrieval models. TREC 2025 saw significant methodological innovation, strong multi-system participation, and the identification of key open challenges for IR research (Arguello et al., 28 Jan 2026, Zhou et al., 21 Jan 2026).
1. Task Definition and Unique Query Characteristics
The TREC 2025 ToT track addresses known-item retrieval when the searcher cannot recall a canonical identifier (e.g., movie title, celebrity name, landmark). Queries typically contain:
- Semantic and episodic memory cues: Features of the item, and contextual details such as where or when the entity was last experienced.
- Non-standard linguistic phenomena: Hedging or uncertainty indicators (“I think”), explicit negative constraints (“but not this one”), false memories (incorrect details), relative comparisons (“more like X than Y”), and social niceties.
- Query verbosity: ToT queries are substantially longer and more complex than typical keyword or natural language queries, often exceeding 300 characters.
These features fundamentally challenge standard IR systems, which rely on concise, focused queries and often lack mechanisms to process uncertainty, negate facts, or handle misremembered details (Arguello et al., 28 Jan 2026).
2. Data Resources and Test Query Design
Corpus
- The document collection is a 2023 English Wikipedia dump, filtered to approximately 6.4 million articles spanning 53 broad domains.
- Each article includes a unique identifier, title, URL, full text, and section markers.
- It is guaranteed that each query's correct answer is present in the corpus.
Training and Development Sets
- Training: 143 movie-domain ToT queries from the Microsoft MS-ToT Known-Item Retrieval Dataset, rooted in real-world ToT scenarios sourced from a user forum.
- Development: Three dev sets comprising two splits of MS-ToT movie queries from past TREC tracks and a synthetic set (movie, celebrity, landmark) created with LLMs.
Test Queries
A total of 622 test queries were constructed:
| Source | Domains | Quantity | Generation Protocol |
|---|---|---|---|
| MS-ToT | Movie | 172 | Sampled from MS-ToT dataset |
| Human-elicited | Movie, celebrity, landmark | 150 | Four-phase image-based protocol: image, recognizability check, memory check, verbose ToT query elicitation (≥300 chars), Wikipedia confirmation |
| Synthetic | 50 broad domains | 300 | 150 each by Llama-3.1-8B-Instruct and GPT-4o; LLMs prompted to create ToT-style 200-word forum posts per entity |
No additional corpus cleaning was mandated, participants could filter or post-process as desired, and explicit warning was issued to avoid using the test queries from the MS-ToT domain for system tuning (Arguello et al., 28 Jan 2026).
3. System Participation and Methodological Diversity
Nine research groups and three baseline submissions collectively entered 32 system runs. The approaches spanned:
- Baseline methods: BM25 (Anserini and PyTerrier) and a dual-encoder dense retriever (Lightning IR).
- Hybrid/fusion pipelines: SRCB fused sparse BM25 retrieval, dense embedding retrieval, and a learned reranker, leveraging external corpora for pretraining.
- Two-stage cascades: DS@GT implemented a multi-method first stage (LLM retrieval, BM25, dense embedding, topic-aware multi-index retrieval) with a subsequent LambdaMART or Gemini-2.5-flash LLM reranker (Zhou et al., 21 Jan 2026).
- Query-aware expansion: UVA ILLC explored hedge- and negation-detection to better capture uncertainty in user expressions.
- Synthetic index comparison: Webis directly compared BM25 runs over GPT- and Llama-generated document indices.
Out of 29 non-baseline runs, 18 relied solely on provided data, six combined track and outside data, and eight used exclusively external data. The table summarizes participant strategies:
| Group | Distinctive Methodologies | Reranking |
|---|---|---|
| SRCB | Sparse+dense fusion, learned reranker, pretraining | Transformer-based |
| DS@GT | BM25+LLM+dense+topic, hybrid fusion | LambdaMART, Gemini-2.5-flash LLM |
| UVA ILLC | Hedge/negation-aware query expansion | Transformer |
| Webis | BM25 over synthetic indices | BM25 |
4. Evaluation Methodology and Performance Metrics
Systems were required to output a ranked list of up to 1,000 Wikipedia document IDs per query. The primary and auxiliary metrics included:
- Normalized Discounted Cumulative Gain at k (nDCG@k):
where is the normalization constant for the ideal ranking.
- Mean Reciprocal Rank (MRR):
- Precision@k (P@k):
- Recall@1000: Fraction of queries whose ground-truth item appeared in the top 1000 results.
Metric correlations over all runs showed nDCG@10, nDCG@1000, and MRR are strongly correlated (Pearson ), while Recall@1000 correlated less strongly with other metrics (Arguello et al., 28 Jan 2026).
5. Leading Results and Technical Insights
Best-Performing Systems
| Run (Group) | nDCG@1000 | MRR | Recall@1000 | nDCG@10 |
|---|---|---|---|---|
| scrb-tot-04 (SRCB) | 0.6824 | 0.6258 | 0.9051 | 0.6576 |
| scrb-tot-03 (SRCB) | 0.6787 | — | — | — |
| scrb-tot-02 (SRCB) | 0.6700 | — | — | — |
| scrb-tot-01 (SRCB) | 0.6458 | — | — | — |
| gmn-rerank-500 (DS@GT) | 0.4106 | — | 0.6559 | — |
Hybridization—merging BM25, dense (BGE-M3), and LLM-based signals in first-stage retrieval, then using learned rerankers such as LambdaMART or LLM-based rerankers (Gemini-2.5-flash)—consistently outperformed individual retrieval models. Synthetic query generation proved highly effective for training rerankers and generating evaluation sets, with Pearson correlation exceeding 0.93 between LLM-synthesized and real query trends. Topic-aware multi-index dense retrieval, where queries are routed through topic-partitioned FAISS indices, further improved both recall and nDCG@1000 (Zhou et al., 21 Jan 2026).
Notable reranking pipeline:
- For LambdaMART, features include BGE-M3 and BM25 scores, normalized pageview, PageRank, and query word count. Optimization used a pairwise NDCG-oriented objective:
where .
6. Query Source Analysis and Error Modes
Performance on different query sources (MS-ToT, human-elicited, synthetic) varied:
- Systems' rankings on MS-ToT vs. synthetic queries had Kendall's (strong agreement), MS-ToT vs. NIST , and synthetic vs. NIST .
- MS-ToT movie queries, with higher rates of false memories and complicated relational clues, were on average more difficult than synthetic or NIST-crafted queries.
- All systems demonstrated sensitivity to strong false memories and multi-hop or relative comparisons, such as “more like X than Y,” and were less equipped to handle episodic/temporal cues that do not appear in static document representations.
- Some LLM-based rerankers overfit to surface lexical cues, underperforming when identifier tokens were absent from either the query or retrieved passage.
7. Open Challenges and Future Research Directions
Key challenges and avenues for progress include:
- Handling of hallucinated memories: Systems must better identify and down-weight incorrect details within verbose user queries.
- Multi-hop and relative comparison reasoning: Advancing methods for interpreting and resolving relational and comparative expressions.
- Contextual and episodic information: Incorporating time, setting, and experiential descriptors that are “off-index” in static corpora.
- Query understanding and decomposition: Automated detection of hedges, negation, and social niceties, with decomposition into semantic and contextual fragments, is needed to isolate actionable retrieval cues.
- Representation learning: Integrating temporal/contextual “memory embeddings” and developing multi-modal ToT retrieval capabilities (e.g., combining image features with text).
- Synthetic data methodology: Further advances in LLM prompt engineering and large-scale query-item pair generation across new domains (e.g., music, products) are essential for robust model training and evaluation.
- Mixed-initiative and interactive retrieval: Proposals for retrieval systems to engage in clarifying dialogue with users, to iteratively refine and disambiguate vague or error-prone queries, align with the fundamental cognitive challenge posed by ToT phenomena (Arguello et al., 28 Jan 2026).
TREC 2025 established that hybrid sparse-dense pipelines, strong reranking architectures, and synthetic data augmentation are best-in-class for ToT retrieval, but the domain remains a proving ground for advanced query understanding, contextual modeling, and multi-modal reasoning (Arguello et al., 28 Jan 2026, Zhou et al., 21 Jan 2026).