SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval

Published 2 Apr 2026 in cs.IR | (2604.02431v1)

Abstract: Retrieving relevant past interactions from long-term conversational memory typically relies on large dense retrieval models (110M-1.5B parameters) or LLM-augmented indexing. We introduce SelRoute, a framework that routes each query to a specialized retrieval pipeline -- lexical, semantic, hybrid, or vocabulary-enriched -- based on its query type. On LongMemEval_M (Wu et al., 2024), SelRoute achieves Recall@5 of 0.800 with bge-base-en-v1.5 (109M parameters) and 0.786 with bge-small-en-v1.5 (33M parameters), compared to 0.762 for Contriever with LLM-generated fact keys. A zero-ML baseline using SQLite FTS5 alone achieves NDCG@5 of 0.692, already exceeding all published baselines on ranking quality -- a gap we attribute partly to implementation differences in lexical retrieval. Five-fold stratified cross-validation confirms routing stability (CV gap of 1.3-2.4 Recall@5 points; routes stable for 4/6 query types across folds). A regex-based query-type classifier achieves 83% effective routing accuracy, and end-to-end retrieval with predicted types (Recall@5 = 0.689) still outperforms uniform baselines. Cross-benchmark evaluation on 8 additional benchmarks spanning 62,000+ instances -- including MSDialog, LoCoMo, QReCC, and PerLTQA -- confirms generalization without benchmark-specific tuning, while exposing a clear failure mode on reasoning-intensive retrieval (RECOR Recall@5 = 0.149) that bounds the claim. We also identify an enrichment-embedding asymmetry: vocabulary expansion at storage time improves lexical search but degrades embedding search, motivating per-pipeline enrichment decisions. The full system requires no GPU and no LLM inference at query time.

Abstract PDF Upgrade to Chat

Authors (1)

Matthew McKee

Summary

The paper presents a deterministic, query-type-based routing strategy that achieves a Recall@5 of 0.800, outperforming uniform dense and hybrid retrieval methods.
It combines multiple pipelines—including lexical, dense, hybrid, and vocabulary-enriched searches—to tailor retrieval according to the semantic structure of the query.
Experimental results demonstrate significant per-type gains, with up to a 0.161 improvement in Recall@5, and highlight the system’s efficiency in resource-constrained settings.

Query-Type-Aware Retrieval Routing for Long-Term Conversational Memory: An Analysis of SelRoute

Overview

The paper "SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval" (2604.02431) systematically investigates retrieval strategies for conversational long-term memory, proposing that explicit query-type-aware routing surpasses improvements from further scaling dense retrievers or augmenting indices with LLM-generated summaries. SelRoute leverages deterministic routing based on query type, combining lexical, semantic, hybrid, and vocabulary-enriched pipelines. The work substantiates that retrieval effectiveness is maximized not by a universal model or strategy, but by aligning the retrieval approach to the structure and semantics of the incoming query.

Motivation and Foundational Observations

Existing approaches to conversational memory retrieval, especially on benchmarks like LongMemEval [wu2024longmemeval], heavily utilize parameter-rich dense retrievers (110M–1.5B parameters) and LLM-based fact-key expansions at index time. These methods establish an implicit scaling hypothesis—that larger models and LLM augmentations will naturally yield superior recall and NDCG. The authors challenge this paradigm by highlighting:

Complementarity Across Retrieval Methods: Lexical (sparse) and embedding-based (dense) retrievers excel on distinct query types due to inherent differences in specificity and semantic coverage.
Storage-Time Vocabulary Enrichment Asymmetry: Manual, rule-based enrichment improves lexical retrieval effectiveness but consistently degrades dense-retriever performance.
Query-Type as a Prior: Empirical analysis shows that the optimal retrieval pipeline can often be predictively determined by discrete query categories, making a routing-based architecture both feasible and advantageous.

SelRoute Framework

Retrieval Pipelines

SelRoute integrates the following retrieval paradigms:

FTS5 Full-Text Search: BM25-ranked, lexically driven search, inclusive of zero-ML operation modes.
BERT-Derived Embedding Search: Multiple model scales (MiniLM, bge-small, bge-base), using cosine similarity over 384-768D vectors.
Hybrid Search: Reciprocal Rank Fusion (RRF) to combine the outputs of sparse and dense methods.
Vocabulary-Enriched Search: Manual augmentation via hypernyms, action bridges, and topic rooms, selectively applied.

Storage-Time Vocabulary Enrichment

Content is expanded, pre-index, using handcrafted mappings:

Hypernyms: Generalize content (cocktail → drink, beverage).
Action Bridges: Match action verbs and events (attended → went, participated).
Topic Rooms: Add contextually relevant domain terms.

Importantly, this enrichment is pipeline-specific: only queries routed to lexical pipelines leverage the enriched content, preventing the observed semantic drift that harms embedding retrieval.

Selective Routing Algorithm

A deterministic routing table, derived from empirical results over “failure instances," prescribes the optimal pipeline per query category:

Query Type	Pipeline	Enrichment	Embeddings
knowledge-update	enriched_fts	Yes	No
multi-session	enriched_hybrid	Yes	Yes
single-session-assistant	embeddings	No	Yes
single-session-preference	embeddings	No	Yes
single-session-user	baseline_fts	No	No
temporal-reasoning	hybrid	No	Yes

Experimental Results

Main Results: LongMemEval_M

SelRoute (bge-base) achieves Recall@5 = 0.800, exceeding the strongest Contriever + fact-keys baseline (Recall@5 = 0.762) without any LLM inference at query time. Notably, even the zero-ML FTS5 baseline (Ra@5 = 0.745, NDCG@5 = 0.692) surpasses published BM25 (Ra@5 = 0.634), indicating that implementation specifics can significantly influence results.

Routing yields robust improvements:

Per-type Gains: Single-session-assistant queries see a +0.161 gain in Recall@5, while single-session-user queries exhibit a ceiling effect with FTS5 alone.
Ablation: No single uniform pipeline—hybrid, enriched FTS5, or pure embeddings—matches routed performance, establishing the efficacy of the routing design.
Negative Result: Aggressive preprocessing (the “dream cycle”) catastrophically degrades recall, emphasizing the necessity of retaining original content alongside enrichment for effective retrieval.

Robustness and Generalization

Cross-Validation: Five-fold stratified CV yields a small generalization gap (1.3–2.4 Recall@5 points), and routing assignments are stable across folds for most query types.
Query-Type Classification: A regex-based non-ML classifier achieves 83% effective routing accuracy. Even with 72% raw classification accuracy, end-to-end retrieval with predicted types (Recall@5 = 0.689) still outperforms uniform strategies.
Cross-Benchmark Evaluation: The routing table generalized without tuning across 8 further datasets (e.g., MSDialog, LoCoMo, QReCC, PerLTQA) totaling 62k+ instances. SelRoute maintained or exceeded baseline recall except on highly reasoning-intensive benchmarks (RECOR, Recall@5 = 0.149), establishing a performance boundary.

Statistical Significance

Bootstrap significance tests confirm that SelRoute’s improvements over pure FTS5, pure embeddings, and uniform hybrid search are statistically robust ( $p < 0.001$ for each).

Implications

Practical System Design

Retrieval as Heterogeneous Sub-Problem: Conversational memory retrieval is not monolithic. Optimizing for query-type-specific characteristics yields greater practical gains than scaling model size or deploying compute-intensive LLM augmentations.
Enrichment–Embedding Asymmetry: The finding mandates that enrichment only be applied to lexical indices. This design principle is immediately actionable for all systems employing hybrid retrieval strategies.
Efficiency: SelRoute does not require GPUs or LLM inference at query time, making it suitable for deployment in resource-constrained settings.

Comparison with Prior Art

Contriever + Fact Keys vs. Rule-Based Enrichment: SelRoute achieves stronger recall using deterministic, hand-crafted enrichment and query-type routing, while incurring no inference-time LLM cost.
FTS5 vs. BM25: The observed baseline gaps accentuate the importance of standardizing evaluation pipelines and index settings.

Limitations & Future Directions

Query-Type Classifier: While non-ML-type prediction is feasible and valuable, there remains a measurable (2.1 Recall@5 point) gap between oracle-type and predicted-type routing. Integrating lightweight ML classifiers or few-shot LLM reasoning may close this.
Reasoning-Intensive Queries: SelRoute cannot substitute for inference-time LLM reasoning in complex multi-hop scenarios (evidenced by poor RECOR performance). For such regimes, hybridization with LLM-augmented reasoning remains essential.
Manual Enrichment Maintenance: The method’s reliance on handcrafted rules can impede extensibility; automated or data-driven vocabulary expansion may further enhance performance and reduce labor.

Conclusion

SelRoute establishes that query-type-aware routing among classical and embedding-based pipelines is a superior strategy for long-term conversational memory retrieval when compared to further scaling monolithic retrievers or relying exclusively on LLM augmentations. Performance is robust across benchmarks and systems without query-time LLM inference, offering a new axis of architectural leverage: exploitation of query structure rather than brute-force model scaling. As query routing, hybridization, and per-query enrichment see broader adoption, SelRoute’s results delineate design principles relevant both for empirical study and practical system deployment, while identifying sharp limitations where deep, inference-time semantic reasoning is non-negotiable.

Markdown Report Issue