Learned Soft Queries in Neural Systems
- Learned Soft Queries are adaptive, trainable query embeddings that capture nuanced semantics for improved retrieval, preference alignment, and cache management in Transformers.
- They integrate LLM-augmented teacher–student distillation, soft aggregation of multi-judge outputs, and trainable soft tokens to enhance model robustness and global attention.
- Empirical results indicate marginal in-domain gains but significant out-of-domain improvements and performance boosts in long-context generation tasks.
Learned Soft Queries (Judge Q) cover a spectrum of neural mechanisms in which queries—whether for retrieval, evaluation, or attention—are not restricted to hard-coded objects, tokens, or pooling strategies, but are themselves trainable or synthesized to better capture the intended semantics or utility. The “Judge Q” designation has been applied in three distinct research contexts: dense retrieval with LLM-augmented teacher–student distillation, soft aggregation of multi-rubric judge outputs for preference modeling, and trainable soft queries for key–value (KV) cache eviction in Transformer LLMs. Each instantiation leverages the learnability and adaptability of queries to improve efficiency, robustness, or generalization in modern AI systems.
1. Soft Queries in Dense Retrieval with LLM Expansion
In dense retrieval, “soft queries” denote a learned embedding space in which input queries are mapped, with the goal of inheriting the expressive semantics of expanded queries—typically augmented by a LLM without incurring inference-time cost. The SoftQE framework (Pimpalkhute et al., 2024) establishes this paradigm by mapping a vanilla query (from space ) via a student encoder directly into the embedding space of the teacher's LLM-augmented expansions.
Technical Formulation
- Define as the initial query embedding from a dual-encoder.
- Use an LLM (e.g., text-davinci-003) and prompt to generate a pseudo-document .
- The expanded query is .
- The teacher encoder embeds as .
- The student encoder learns to produce , known as the “soft query” embedding.
Model Architecture and Optimization
Both teacher and student encoders are standard Transformer-based dual encoders (e.g., BERT), producing a [CLS] token pooled vector, possibly normalized. The training objective is a convex combination of contrastive retrieval loss and mean-squared error (MSE) distillation:
where and is the standard contrastive loss over (query, positive passage, negatives). Empirically, a “warm-up” schedule for the first three epochs, then for three epochs, yields best results.
Inference and Impact
At inference, no LLM expansion is used: queries are mapped by the student encoder, and document inner-product ranking is performed using precomputed document embeddings. Notably, SoftQE yields only marginal in-domain gains (+0.13 absolute for MS-MARCO MRR@10), but consistently improves out-of-domain BEIR tasks by an average of +2.83 nDCG@10, indicating enhanced robustness to domain shift (Pimpalkhute et al., 2024).
2. Learned Judge Q Aggregators for Preference Modeling
In the context of aligning LLM-based judges with human preferences, “Judge Q” refers to a learned soft aggregation mechanism that synthesizes preference predictions from multiple rubric-conditioned LLM judges. The key objective is to approximate (potentially diverse or conflicting) human-like persona-based preferences, enabling more robust reward modeling for RLHF or LLM routing decisions (Sprejer et al., 29 Oct 2025).
Multi-Judge Setup and Aggregation
- rubric-conditioned LLM judges yield scalar scores for input (typically prompt–answer pairs).
- personas serve as synthetic human raters, each outputting ground-truth preference labels , to simulate human heterogeneity.
- For each , collect score vector .
- A parametric aggregator is trained to match persona outputs: .
Aggregator Implementations
Each is a learned spline to recalibrate judge 's output.
- Multi-Layer Perceptron (MLP):
, via hyperparameter search.
Objective, Evaluation, and Robustness
The core objective is MSE regression:
Robustness is assessed under both persona label and judge-rubric perturbations. Notably, GAM and MLP aggregators maintain even under substantial noise, where naive means deteriorate. Judge importance (as $1-p$ of spline term) reveals dimensions such as Truthfulness and Logical Consistency to be most influential on the synthetic preference metric (Sprejer et al., 29 Oct 2025).
3. Trainable Soft Queries for KV Cache Eviction in Transformers
The “Judge Q” approach in Transformer LLMs addresses the challenge of efficient KV cache eviction during long-context sequence generation. Traditional strategies select the last tokens to compute importance scores for cache eviction, an approach biased toward local context. Judge Q instead learns a set of soft-token embeddings that, once appended to the prompt, compute attention over all positions, yielding improved importance estimation for global information retention.
Methodology
- Define learnable soft tokens; their embeddings are the only tunable parameters.
- At each training step, two sequences are used:
- Input: (prompt, soft tokens)
- Input: (prompt, response tokens)
- For both, attention maps from the query tokens (soft or response) to each prompt token are extracted, yielding .
- Objective: minimize MSE between and :
Inference and Efficacy
During prefill inference:
- The soft tokens are appended and attention scores computed.
- All key–value pairs are ranked by their importance scores; the top- under the token budget are retained.
- Soft tokens are discarded before ongoing decoding.
Judge Q achieves higher “critical KV hit rate” than best windowed methods and less performance drop under tight memory constraints, yielding +1 to +3 points on LongBench and RULER across budgets (e.g., Judge Q @512 achieves 39.17 vs. SnapKV's 38.31 on LongBench; 74.12 vs. 68.21 on RULER) (Liu et al., 13 Sep 2025).
4. Comparative Summary of Judge Q Approaches
| Application Domain | Mechanism | Notable Outcome |
|---|---|---|
| Dense Retrieval (Pimpalkhute et al., 2024) | Student soft query matches LLM-expanded teacher embedding | +2.83 nDCG@10 on BEIR |
| Preference Modeling (Sprejer et al., 29 Oct 2025) | Soft aggregation via GAM/MLP over rubric judges | ≈ 0.58, robust to bias |
| KV Cache Eviction (Liu et al., 13 Sep 2025) | Trainable soft tokens attend globally for KV scoring | +1–3 pt. on long-context |
All implementations share reliance on gradient-based learning of query objects that mediate between source input and a desired utility function—either semantic richness, preference alignment, or information coverage.
5. Implementation Considerations and Limitations
Integration of Learned Soft Queries is notably efficient:
- In dense retrieval, only query encoder weights are affected, with no change to latency, as LLM expansion is not required at inference (Pimpalkhute et al., 2024).
- In preference aggregation, only the aggregator is trained, and the interpretability of GAM splines aids in auditability and fairness analysis (Sprejer et al., 29 Oct 2025).
- For KV cache eviction, only soft-token embeddings are updated; no need for model-wide fine-tuning, with memory/compute overhead dominated by per layer attention during prefill (Liu et al., 13 Sep 2025).
Known caveats include the synthetic nature of persona-based labels in preference modeling, potential LLM circularity, limitation to single-vector encoders or scalar aggregations in some settings, and the need for broader human calibration.
6. Broader Impact and Future Directions
Learned Soft Queries, both in “Judge Q” and related forms, directly enable:
- Improved zero-shot retrieval transfer, as LLM-driven expansions expose paraphrastic and rare event knowledge (Pimpalkhute et al., 2024).
- Fine-grained, robust, and interpretable aggregation of LLM judges for RLHF, with enhanced resistance to rubric-induced bias and instability (Sprejer et al., 29 Oct 2025).
- Globally-aware cache retention in long-context LLMs, crucial for efficient generation under resource pressure (Liu et al., 13 Sep 2025).
Promising directions include extending soft queries to multi-vector and late-interaction architectures, refining persona label distributions beyond uniform sampling, incorporating rank-based losses into aggregator training, and large-scale human validation of preference alignment.