SandwichR: Fast Query Correction Paradigm
- SandwichR is a query correction paradigm that decomposes outputs into initial correction, reasoning chain, and final correction to reduce latency.
- It employs an Answer–Reasoning–Answer structure with a consistency-aware reinforcement learning objective to align rapid initial responses with in-depth reasoning.
- The framework achieves state-of-the-art accuracy with a 40–70% reduction in inference latency, validated on multi-domain, error-injected query datasets.
Sandwich Reasoning (SandwichR) is a paradigm and training strategy for low-latency query correction in search pipelines. It addresses the latency bottleneck of Chain-of-Thought (CoT) reasoning, enabling LLMs to rapidly produce high-quality corrections while explicitly leveraging reasoning without incurring prohibitive computational costs. The framework is defined by its "Answer–Reasoning–Answer" output structure and a consistency-aware reinforcement learning objective, underpinned by targeted sampling and a newly constructed benchmark dataset. SandwichR achieves state-of-the-art accuracy with a 40–70% reduction in inference latency compared to standard reasoning-first approaches (Zhang et al., 7 Jan 2026).
1. "Answer–Reasoning–Answer" Inference Paradigm
SandwichR decomposes output generation for a noisy query into three sequential segments: initial correction (), explicit reasoning (), and final correction (), so that .
- Initial Correction (): The model generates tokens conditioned solely on , yielding an immediate correction returned for downstream retrieval.
- Reasoning Chain (): Tokens encode the model's analytic process, such as error location, type, and justification, marked for separate parsing within output tags (e.g., "<reasoning>…</reasoning>").
- Final Correction (): Conditioned on both and the preceding , the model emits . This step aligns the initial answer with reasoning-derived insights and supervises the initial answer’s quality.
During real-time inference under tight token budgets, only need be generated, ensuring low latency. The model’s training aligns the initial answer with downstream reasoning, so the emitted correction maintains reasoning-level accuracy even when later output segments are omitted.
2. Consistency-Aware Reinforcement Learning (RL) Objective
SandwichR employs a two-stage training. After supervised fine-tuning (SFT) teaches the model to format outputs as , RL is used to enforce consistency across segments and incentivize high-fidelity corrections.
- Reward Composition:
- Accuracy Reward: if ; otherwise, , where is the F0.5-score.
- Format × Consistency Reward: iff output matches SandwichR structure and ; 0 otherwise.
- Total reward:
Gradient Update: The objective is optimized via REINFORCE gradients:
In practice, the policy gradient is stabilized using GRPO (a variant of PPO/GRECO), rewarding high , exact output format, and segment consistency. This alignment ensures that rapid initial corrections () reflect post-hoc reasoning (, ), resolving the latency-accuracy trade-off.
3. Margin-Based Rejection Sampling for RL Dataset
To concentrate RL signal on challenging ("borderline") samples, SandwichR adopts targeted instance filtering:
For each candidate input in SFT pool, sample outputs from the initial model. Compute pass@: the count where .
Accept into RL dataset iff pass@; otherwise, reject.
This strategy restricts RL to queries where the model exhibits partial correctness, thus maximizing gradient leverage on inputs amenable to reasoning-driven improvement.
4. High-Quality Multi-Domain Query Correction Dataset
Recognizing the absence of dedicated benchmarks for complex query correction in real-world pipelines, the authors established a multi-domain, error-injected dataset using noisy transformations of real user queries:
Domains: E-commerce, Video, Medical.
Error synthesis: Each via:
- Wrong Words (confusable substitution by phonetics/graphics),
- Missing Words (single word deletions),
- Disorder Words (swap of adjacent terms).
- Each query contains exactly one error, consistent with short, realistic queries.
- Reasoning annotations: Collected via GPT-4o to provide explicit reasoning traces.
- Statistics:
| Domain | #Train | #Dev | #Test | AvgLen |
|---|---|---|---|---|
| E-commerce | 90,511 | 1,000 | 1,000 | ≈6.4 |
| Video | 88,736 | 1,000 | 1,000 | ≈7.1 |
| Medical | 94,176 | 1,000 | 1,000 | ≈16.1 |
This dataset enables systematic evaluation in realistic, multi-domain settings and supports the explicit reasoning trajectories integral to SandwichR.
5. Comparative Evaluation and Latency Analysis
SandwichR was evaluated using Qwen2.5-1.5B-Instruct with LoRA SFT and GRPO RL, compared against baselines: traditional seq2seq (mT5, BART SFT), CoT prompting (Ans–Rea, Rea–Ans), and larger LLMs (GPT-4o, QwQ-32B, Deepseek-R1, GrammarGPT-7B). Metrics included , accuracy, and query latency under strict token budgets (20 tokens for E-commerce/Video, 40 for Medical).
| Model | E-com F0.5 | E-com Acc | Video F0.5 | Video Acc | Med F0.5 | Med Acc |
|---|---|---|---|---|---|---|
| Ans–Rea SFT+RL | 0.211 | 0.200 | 0.316 | 0.292 | 0.392 | 0.363 |
| Rea–Ans SFT+RL | 0.216 | 0.207 | 0.318 | 0.301 | 0.387 | 0.364 |
| SandwichR SFT+RL | 0.221 | 0.213 | 0.325 | 0.307 | 0.396 | 0.375 |
Latency evaluation (under limited token budgets):
| Method | Budget | E-com Acc | E-com Time (s) | Video Acc | Video Time (s) | Med Acc | Med Time (s) |
|---|---|---|---|---|---|---|---|
| Rea–Ans | Full (256) | 0.207 | 1.959 | 0.301 | 1.143 | 0.364 | 1.550 |
| Rea–Ans | Limited | ≈0.000 | 0.457 | ≈0.000 | 0.484 | 0.009 | 0.900 |
| Ans–Rea | Limited | 0.200 | 0.464 | 0.292 | 0.446 | 0.359 | 0.893 |
| SandwichR | Limited | 0.213 | 0.467 | 0.307 | 0.474 | 0.374 | 0.924 |
SandwichR achieves a 40–70% reduction in latency compared to standard CoT (Rea–Ans) with negligible drop in accuracy, thus resolving the fundamental trade-off in online query correction.
6. Training Procedure, Inference Pseudocode, and Complexity Profile
Training proceeds via SFT to learn the SandwichR format, followed by RL on a filtered dataset to maximize alignment of initial and reasoning-informed corrections. The computational bottleneck in autoregressive decoding is mitigated by early emission of , enabling inference in time under a token budget .
SandwichR Training:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
procedure TRAIN_SANDWICHR(D_train): # Stage 1: SFT for (x, y_clean) in D_train: prompt ← format_SandwichR_prompt(x) target ← [y_clean; reasoning_by_GPT4o; y_clean] θ ← SFT_update(θ, prompt, target) end for # Stage 2: Consistency-Aware RL S_RL ← {} for x in small_subset(D_train): samples = {model_sample(x) for i=1..N} if any(F0.5(sample.C_final, y_clean)>0): add x to S_RL end for for epoch in 1..E: for x in S_RL: y ~ p_θ(·|x) compute R_acc, R_fmt, R_cons, R_total θ ← GRPO_update(θ, y, R_total) end for end for end procedure |
1 2 3 4 5 6 7 8 9 |
procedure INFER_SANDWICHR(x, budget): y_seq ← [] # Generate C_init while not end_of_C_init and token_count < budget: y_seq += sample_next_token() end while return y_seq # use initial correction immediately # Optionally continue to produce R and C_final if tokens remain end procedure |
Standard CoT (Rea–Ans) under strict budgets cannot emit a corrected answer, driving accuracy to zero. SandwichR, by front-loading , achieves significant speedup while maintaining reasoning-level correction quality through consistency-aware training.
7. Context and Significance in Search Pipelines
SandwichR advances LLM-based search query correction under operational constraints where real-time accuracy is paramount. Its paradigm enables practical deployment by decoupling answer emission from exhaustive reasoning, then leveraging RL to ensure answer quality is not compromised. The approach is extensible to domains with structured outputs and similar latency demands. The introduction of a reasoning-annotated, error-injected benchmark sets a foundation for further research in domain-adaptive correction tasks and reinforces the utility of aligned multi-stage output in time-critical LLM applications (Zhang et al., 7 Jan 2026).