Learned Soft Queries in Neural Systems

Updated 27 January 2026

Learned Soft Queries are adaptive, trainable query embeddings that capture nuanced semantics for improved retrieval, preference alignment, and cache management in Transformers.
They integrate LLM-augmented teacher–student distillation, soft aggregation of multi-judge outputs, and trainable soft tokens to enhance model robustness and global attention.
Empirical results indicate marginal in-domain gains but significant out-of-domain improvements and performance boosts in long-context generation tasks.

Learned Soft Queries (Judge Q) cover a spectrum of neural mechanisms in which queries—whether for retrieval, evaluation, or attention—are not restricted to hard-coded objects, tokens, or pooling strategies, but are themselves trainable or synthesized to better capture the intended semantics or utility. The “Judge Q” designation has been applied in three distinct research contexts: dense retrieval with LLM-augmented teacher–student distillation, soft aggregation of multi-rubric judge outputs for preference modeling, and trainable soft queries for key–value (KV) cache eviction in Transformer LLMs. Each instantiation leverages the learnability and adaptability of queries to improve efficiency, robustness, or generalization in modern AI systems.

1. Soft Queries in Dense Retrieval with LLM Expansion

In dense retrieval, “soft queries” denote a learned embedding space in which input queries are mapped, with the goal of inheriting the expressive semantics of expanded queries—typically augmented by a LLM without incurring inference-time cost. The SoftQE framework (Pimpalkhute et al., 2024) establishes this paradigm by mapping a vanilla query $q$ (from space $Q$ ) via a student encoder directly into the embedding space of the teacher's LLM-augmented expansions.

Technical Formulation

Define $h_p(q)\in\mathbb{R}^d$ as the initial query embedding from a dual-encoder.
Use an LLM $g_{(\phi)}$ (e.g., text-davinci-003) and prompt $\mathcal{I}$ to generate a pseudo-document $d' = g_{(\phi)}(\mathcal{I}, q)$ .
The expanded query is $q^+ = q \oplus d'$ .
The teacher encoder $f_\mathrm{teacher}$ embeds $q^+$ as $h_\mathrm{teacher}(q^+)\in\mathbb{R}^d$ .
The student encoder $Q$ 0 learns to produce $Q$ 1, known as the “soft query” embedding.

Model Architecture and Optimization

Both teacher and student encoders are standard Transformer-based dual encoders (e.g., BERT $Q$ 2), producing a [CLS] token pooled vector, possibly $Q$ 3 normalized. The training objective is a convex combination of contrastive retrieval loss and mean-squared error (MSE) distillation:

$Q$ 4

where $Q$ 5 and $Q$ 6 is the standard contrastive loss over (query, positive passage, negatives). Empirically, a “warm-up” schedule $Q$ 7 for the first three epochs, then $Q$ 8 for three epochs, yields best results.

Inference and Impact

At inference, no LLM expansion is used: queries are mapped by the student encoder, and document inner-product ranking is performed using precomputed document embeddings. Notably, SoftQE yields only marginal in-domain gains (+0.13 absolute for MS-MARCO MRR@10), but consistently improves out-of-domain BEIR tasks by an average of +2.83 nDCG@10, indicating enhanced robustness to domain shift (Pimpalkhute et al., 2024).

2. Learned Judge Q Aggregators for Preference Modeling

In the context of aligning LLM-based judges with human preferences, “Judge Q” refers to a learned soft aggregation mechanism that synthesizes preference predictions from multiple rubric-conditioned LLM judges. The key objective is to approximate (potentially diverse or conflicting) human-like persona-based preferences, enabling more robust reward modeling for RLHF or LLM routing decisions (Sprejer et al., 29 Oct 2025).

Multi-Judge Setup and Aggregation

$Q$ 9 rubric-conditioned LLM judges $h_p(q)\in\mathbb{R}^d$ 0 yield scalar scores $h_p(q)\in\mathbb{R}^d$ 1 for input $h_p(q)\in\mathbb{R}^d$ 2 (typically prompt–answer pairs).
$h_p(q)\in\mathbb{R}^d$ 3 personas $h_p(q)\in\mathbb{R}^d$ 4 serve as synthetic human raters, each outputting ground-truth preference labels $h_p(q)\in\mathbb{R}^d$ 5, to simulate human heterogeneity.
For each $h_p(q)\in\mathbb{R}^d$ 6, collect score vector $h_p(q)\in\mathbb{R}^d$ 7.
A parametric aggregator $h_p(q)\in\mathbb{R}^d$ 8 is trained to match persona outputs: $h_p(q)\in\mathbb{R}^d$ 9.

Aggregator Implementations

Generalized Additive Model (GAM):

$g_{(\phi)}$ 0

Each $g_{(\phi)}$ 1 is a learned spline to recalibrate judge $g_{(\phi)}$ 2's output.

Multi-Layer Perceptron (MLP):

$g_{(\phi)}$ 3

$g_{(\phi)}$ 4, $g_{(\phi)}$ 5 via hyperparameter search.

Objective, Evaluation, and Robustness

The core objective is MSE regression:

$g_{(\phi)}$ 6

Robustness is assessed under both persona label and judge-rubric perturbations. Notably, GAM and MLP aggregators maintain $g_{(\phi)}$ 7 even under substantial noise, where naive means deteriorate. Judge importance (as $g_{(\phi)}$ 8 of spline term) reveals dimensions such as Truthfulness and Logical Consistency to be most influential on the synthetic preference metric (Sprejer et al., 29 Oct 2025).

3. Trainable Soft Queries for KV Cache Eviction in Transformers

The “Judge Q” approach in Transformer LLMs addresses the challenge of efficient KV cache eviction during long-context sequence generation. Traditional strategies select the last $g_{(\phi)}$ 9 tokens to compute importance scores for cache eviction, an approach biased toward local context. Judge Q instead learns a set of $\mathcal{I}$ 0 soft-token embeddings that, once appended to the prompt, compute attention over all positions, yielding improved importance estimation for global information retention.

Methodology

Define $\mathcal{I}$ 1 learnable soft tokens; their embeddings $\mathcal{I}$ 2 are the only tunable parameters.
At each training step, two sequences are used:
- Input $\mathcal{I}$ 3: (prompt, soft tokens)
- Input $\mathcal{I}$ 4: (prompt, response tokens)
For both, attention maps from the query tokens (soft or response) to each prompt token are extracted, yielding $\mathcal{I}$ 5.
Objective: minimize MSE between $\mathcal{I}$ 6 and $\mathcal{I}$ 7:

$\mathcal{I}$ 8

Inference and Efficacy

During prefill inference:

The $\mathcal{I}$ 9 soft tokens are appended and attention scores $d' = g_{(\phi)}(\mathcal{I}, q)$ 0 computed.
All key–value pairs are ranked by their importance scores; the top- $d' = g_{(\phi)}(\mathcal{I}, q)$ 1 under the token budget are retained.
Soft tokens are discarded before ongoing decoding.

Judge Q achieves higher “critical KV hit rate” than best windowed methods and less performance drop under tight memory constraints, yielding +1 to +3 points on LongBench and RULER across budgets (e.g., Judge Q @512 achieves 39.17 vs. SnapKV's 38.31 on LongBench; 74.12 vs. 68.21 on RULER) (Liu et al., 13 Sep 2025).

4. Comparative Summary of Judge Q Approaches

Application Domain	Mechanism	Notable Outcome
Dense Retrieval (Pimpalkhute et al., 2024)	Student soft query matches LLM-expanded teacher embedding	+2.83 nDCG@10 on BEIR
Preference Modeling (Sprejer et al., 29 Oct 2025)	Soft aggregation via GAM/MLP over rubric judges	$d' = g_{(\phi)}(\mathcal{I}, q)$ 2 ≈ 0.58, robust to bias
KV Cache Eviction (Liu et al., 13 Sep 2025)	Trainable soft tokens attend globally for KV scoring	+1–3 pt. on long-context

All implementations share reliance on gradient-based learning of query objects that mediate between source input and a desired utility function—either semantic richness, preference alignment, or information coverage.

5. Implementation Considerations and Limitations

Integration of Learned Soft Queries is notably efficient:

In dense retrieval, only query encoder weights are affected, with no change to latency, as LLM expansion is not required at inference (Pimpalkhute et al., 2024).
In preference aggregation, only the aggregator is trained, and the interpretability of GAM splines aids in auditability and fairness analysis (Sprejer et al., 29 Oct 2025).
For KV cache eviction, only $d' = g_{(\phi)}(\mathcal{I}, q)$ 3 soft-token embeddings are updated; no need for model-wide fine-tuning, with memory/compute overhead dominated by $d' = g_{(\phi)}(\mathcal{I}, q)$ 4 per layer attention during prefill (Liu et al., 13 Sep 2025).

Known caveats include the synthetic nature of persona-based labels in preference modeling, potential LLM circularity, limitation to single-vector encoders or scalar aggregations in some settings, and the need for broader human calibration.

6. Broader Impact and Future Directions

Learned Soft Queries, both in “Judge Q” and related forms, directly enable:

Improved zero-shot retrieval transfer, as LLM-driven expansions expose paraphrastic and rare event knowledge (Pimpalkhute et al., 2024).
Fine-grained, robust, and interpretable aggregation of LLM judges for RLHF, with enhanced resistance to rubric-induced bias and instability (Sprejer et al., 29 Oct 2025).
Globally-aware cache retention in long-context LLMs, crucial for efficient generation under resource pressure (Liu et al., 13 Sep 2025).

Promising directions include extending soft queries to multi-vector and late-interaction architectures, refining persona label distributions beyond uniform sampling, incorporating rank-based losses into aggregator training, and large-scale human validation of preference alignment.

Markdown Report Issue Upgrade to Chat

References (3)

SoftQE: Learned Representations of Queries Expanded by LLMs (2024)

Approximating Human Preferences Using a Multi-Judge Learned System (2025)

Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learned Soft Queries (Judge Q).